Professional Documents
Culture Documents
Statquest Illustrated Guide To Machine Learning v2.1
Statquest Illustrated Guide To Machine Learning v2.1
Illustrated Guide
to Machine
Learning!!!
TRIPLE
BAM!!!
www.statquest.org
2
Thank you for buying
The StatQuest Illustrated Guide to Machine Learning!!!
Every penny of your purchase goes to supporting StatQuest and helps create new
videos, books, and webinars. Thanks to you, StatQuest videos and webinars are free on
YouTube for everyone in the whole wide world to enjoy.
NOTE: If you’re borrowing this copy of The StatQuest Illustrated Guide to Machine
Learning from a friend, feel free to enjoy it. However, if you nd it helpful, consider
buying your own copy or supporting StatQuest however you can at
https://statquest.org/support-statquest/.
3
fi
For F and B, for teaching me to think di erently,
T and D, for making it possible for me to think di erently,
and for A, for everything.
4
ff
ff
Since every StatQuest starts with a Silly Song…
jo r
G Ma Ma
or jor C
aj
C M
here!!!
- Quest Il - lus is
The Stat Guide
- tra - ted
StatQuest!!!
Hooray!!!
Hello!!! I’m Josh Starmer, and welcome to The
StatQuest Illustrated Guide to Machine
Learning!!! In this book, we’ll talk about
everything, from the very basics to advanced
topics like Neural Networks. All concepts
will be clearly illustrated, and we’ll go
through them one step at a time.
Table of Contents
01 Fundamental Concepts in Machine Learning!!! 8
02 Cross Validation!!! 21
03 Fundamental Concepts in Statistics!!! 30
04 Linear Regression!!! 75
05 Gradient Descent!!! 83
06 Logistic Regression!!! 108
07 Naive Bayes!!! 120
08 Assessing Model Performance!!! 136
09 Preventing Over tting with Regularization!!! 164
10 Decision Trees!!! 183
11 Support Vector Classi ers and Machines (SVMs)!!! 218
12 Neural Networks!!! 234
Appendices!!! 271
Hey Normalsaurus,
Sure thing StatSquatch! Machine
1 can you summarize all
of machine learning in
Learning (ML) is a collection of tools
and techniques that transforms data
a single sentence?
into (hopefully good) decisions by
making classifications, like whether or
not someone will love a movie, or
quantitative predictions, like how tall
someone is.
Throughout each
3 page, you’ll see
circled numbers like
these…
BAM!
Hey! That’s
me in the
corner!!!
7
Chapter 01
Fundamental
Concepts in
Machine
Learning!!!
Machine Learning: Main Ideas
Hey Normalsaurus,
Sure thing StatSquatch! Machine
1 can you summarize all
of machine learning in
Learning (ML) is a collection of tools
and techniques that transforms data
a single sentence?
into (hopefully good) decisions by
making classi cations, like whether or
not someone will love a movie, or
quantitative predictions, like how tall
someone is.
3
So, let’s get started by talking
about the main ideas of how
machine learning is used for
Classi cation.
BAM!
fi
fi
fi
Machine Learning Classification: Main Ideas
Yes No
f And if you are
interested in machine Then you will like
StatQuest!!! :(
learning, then the
Classi cation Tree
predicts that you will
like StatQuest!!! If you’re not interested
d in machine learning
and don’t like Silly
On the other hand, if you like Songs, then bummer!
e Silly Songs, then the
Classi cation Tree predicts
that you will like StatQuest!!!
10
fi
fi
fi
fi
fi
fi
Machine Learning Regression: Main Ideas
The Problem: We have another pile of data, and we want
1 to use it to make quantitative predictions, which means
that we want to use machine learning to do Regression.
Weight X
11
fi
ff
Comparing Machine Learning Methods:
Main Ideas
The Problem: As we’ll learn in this book, machine learning consists of a
1 lot of di erent methods that allow us to make Classi cations or
Quantitative Predictions. How do we choose which one to use?
Height Height
…or we could use this
green squiggle to predict
Height from Weight.
How do we decide to
Weight use the black line or the Weight
green squiggle?
For example,
…the black line
predicts this Height.
X
given this person’s
Weight…
X
Height
Weight
X
We can compare those two
BAM!!!
Now that we understand the
Main Ideas of how to compare
machine learning methods, let’s
get a better sense of how we do
this in practice.
12
ff
fi
Comparing Machine Learning Methods:
Intuition Part 1
The original data that we use to In other words, the
1 observe the trend and t the line black line is t to the
is called Training Data. Training Data.
Alternatively, we could
Height
2 have t a green squiggle
to the Training Data.
The green squiggle
ts the Training Data
better than the black
line, but remember
the goal of machine
Weight learning is to make
predictions, so we
Height need a way to
determine if the
black line or the
green squiggle
makes better
predictions.
So, we collect more data,
3 called Testing Data… Weight
I sure would,
StatSquatch! So, from
this point forward, look for
the dreaded Terminology
Alert!!!
fi
fi
fi
fi
Comparing Machine Learning Methods:
Intuition Part 2
…then we can compare their Observed
4 Now, if these blue dots
are the Testing Data…
5 Heights to the Heights Predicted by the
black line and the green squiggle.
Height Height
Weight Weight
Height
…and we can
Height
X measure the distance,
or error, between the
X X Observed and
Predicted Heights.
X X Weight
+ = Total
Error
X
Weight
First
Error
14
fi
Comparing Machine Learning Methods:
Intuition Part 3
10 Likewise, we can measure the distances, or errors, between the
Observed Heights and the Heights Predicted by the green squiggle.
Height
X
X X
X
X Weight
X
Weight
+ = Total
Error
Total Black
Line Errors
And we see that the
sum of the errors for
the black line is shorter,
First suggesting that it did a
Error better job making
predictions.
In other words, even though the …the black line did a better
13 green squiggle t the Training
Data way better than the black
14 job predicting Height with the
Testing Data.
line…
Height Height
Weight Weight
15
fi
Comparing Machine Learning Methods:
Intuition Part 4
So, if we had to choose between
15 using the black line or the green
…we would choose the black
squiggle to make predictions… 16 line because it makes better
predictions.
Height
Height
Weight
Weight
BAM!!!
The example we just
went through illustrates
2 Main Ideas about
machine learning. First, we use Testing Data to
evaluate machine learning methods.
Weight
TERMINOLOGY ALERT!!!
When a machine learning method ts the Training Data
really well but makes poor predictions, we say that it is
Over t to the Training Data. Over tting a machine
learning method is related to something called the Bias-
Variance Tradeo , and we’ll talk more about that later.
16
fi
fi
f
fi
fi
The Main Ideas of Machine Learning:
Summary
Now, you may be wondering why
1 we started this book with a super
simple Decision Tree…
…and a simple black line
and a silly green
squiggle…
BAM!!!
Line Error
X
organize the data in a
nice table. 1.2 1.9
1.9 1.7
Now, regardless of
Height whether we look at the 2.0 2.8
data in the graph or in the 2.8 2.3
table, we can see that
In contrast, because we’re not predicting Weight, and thus, Weight does
not depend on Height, we call Weight an Independent Variable.
Alternatively, Weight can be called a Feature.
18
ff
fi
Terminology Alert!!! Discrete and
Continuous Data
For example, we can count the number of people
Discrete Data… 2 that love the color green or love the color blue.
1 …is countable and only
takes speci c values.
Because we are counting
individual people, and the
totals can only be whole
numbers, the data are
Discrete.
4 people 3 people
American shoe sizes are
3 Discrete because even
love green love blue
Continuous Data…
5 …is measurable and can take
any numeric value within a range.
7 NOTE: If we get a
more precise ruler…
…then the
measurements
get more
precise.
181.73 cm
152.11 cm
So the precision
of Continuous
measurements is
only limited by the
tools we use.
19
fi
Now we know about di erent
types of data and how they can
be used for Training and
Testing. What’s next?
20
ff
Chapter 02
Cross Validation!!!
Cross Validation: Main Ideas
The Problem: So far, we’ve However, usually no one tells us
1 simply been told which points what is for Training and what is
are the Training Data… for Testing.
…and which
points are the
Testing Data.
? How do we pick
? the best points
Testing set #1
Rather than worry too
much about which speci c
points are best for Training
and best for Testing,
Cross Validation uses all
points for both in an
iterative way, meaning that
Testing set #2 we use them in steps.
Testing set #3
BAM!!!
22
fi
fi
Cross Validation: Details Part 1
Imagine we gathered these 6
1 pairs of Weight and Height
measurements…
??? ???
TERMINOLOGY ALERT!!!
Reusing the same data for Training and Testing is called Data
Leakage, and it usually results in you believing the machine learning
method will perform better than it actually does because it is Over t.
23
fi
fi
fi
Cross Validation: Details Part 2
A slightly better idea is to randomly select
4 some data to use only for Testing and
select and use the rest for Training.
A Gentle Reminder…
Gentle Reminder:
These are the
original 3 groups.
Group 1 Group 3
Group 2
So, these are the …and these are NOTE: Because each
9 3 iterations of 10 the 3 iterations
of Testing.
iteration uses a di erent
combination of data for
Training…
Training, each iteration
results in a slightly
di erent tted line.
We can average
these errors to get a
#2 general sense of how
well this model will
perform with future
data…
Groups 1 and 3 Group 2
25
ff
ff
ff
fi
fi
ff
ff
Cross Validation: Details Part 4
For example, we could use 3-Fold
11 Cross Validation to compare the errors
from the black line to the errors from
the green squiggle.
A Gentle Reminder…
Gentle Reminder:
These are the
original 3 groups.
Group 1 Group 3
Group 2
12 Training 13 Testing
Fold Cross Validation show that the
black line does a better job making
predictions than the green squiggle.
Total
Green
Squiggle Total
Error Black
Line
Error
#1 By using Cross
vs. Validation, we can
be more con dent
that the black line
Groups 2 and 3 Group 1 will perform better
with new data
without having to
worry about whether
or not we selected
the best data for
#2 Training and the best
vs. data for Testing.
26
fi
Cross Validation: Details Part 5
1 1 1
2 2 2
3 3 Then, we 3
Train using
4 4 the rst 9 4
blocks…
5 5 5
…and Test
6 6 using the 6
tenth block.
7 7 7
8 8 8
9 9 9
10 10
DOUBLE
BAM!!!
27
fi
fi
Cross Validation: Details Part 6
Another commonly used form of Cross
15 Validation is called Leave-One-Out.
vs.
Iteration #1
Iteration #2
vs.
vs.
vs. vs. vs.
Fundamental
Concepts in
Statistics!!!
Statistics: Main Ideas
The Problem: The world is an
1 interesting place, and things are
not always the same.
For example, every time we order
french fries, we don’t always get YUM!!! vs. YUM!
the exact same number of fries.
A Solution: Statistics provides us with a set of tools to quantify the variation that
2 we nd in everything and, for the purposes of machine learning, helps us make
predictions and quantify how con dent we should be in those predictions.
31
fi
ff
fi
fi
fi
fi
fi
Histograms: Main Ideas
The Problem: We have a lot of We could try to make it easier to
1 measurements and want to gain see the hidden measurements by
insights into their hidden trends. stacking any that are exactly the
same…
For example, imagine we measured the
…but measurements that
Heights of so many people that the data,
are exactly the same are
represented by green dots, overlap, and
rare, and a lot of the green
some green dots are completely hidden.
dots are still hidden.
:(
Shorter Taller
…and we end
up with a
histogram!!!
Shorter Taller
BAM!!!
32
Histograms: Details In Chapter 7, we’ll use
histograms to make
classi cations using a machine
The taller the stack within a bin, learning algorithm called Naive
1 the more measurements we Bayes. GET EXCITED!!!
made that fall into that bin.
Shorter Taller
Shorter Taller
BAM!
Shorter Taller
Shorter Taller
33
fi
ff
Histograms: Calculating Probabilities
Step-by-Step
If we want to estimate the
1 probability that the next
measurement will be in
…we count the number of
measurements, or observations,
this red box… in the box and get 12…
12
= 0.63
…and divide by the 19
total number of
measurements, 19…
Shorter Taller
34
fi
fi
Histograms: Calculating Probabilities
Step-by-Step
To estimate the probability that the
3 next measurement will be in a red
box that spans all of the data…
…we count the number
of measurements in the
box, 19…
19
=1
…and divide by 19
the total number
of measurements,
19…
Shorter Taller
Shorter Taller
3
This blue, bell-shaped
curve tells us the same
Shorter Taller
types of things that the
histogram tells us.
For example, the
relatively large amount of
area under the curve in
this red box tells us that
there’s a relatively high
Shorter Taller probability that we will
measure someone whose
value falls in this region.
Now, even though we never …we can use the area
4 measured someone who’s under the curve to
value fell in this range… estimate the probability
of measuring a value in 5
this range.
NOTE: Because we have
Discrete and Continuous
data…
Shorter Taller
Shorter Taller
One of the most commonly used As you can see, it’s a mathematical
3 Discrete Probability Distributions equation, so it doesn’t depend on
is the Binomial Distribution. collecting tons of data, but, at least to
StatSquatch, it looks super scary!!!
The Binomial
( x!(n − x)! )
n! Distribution makes
p(x | n, p) = p x(1 − p)n−x me want to run
away and hide.
37
fi
The Binomial Distribution: Main Ideas Part 1
First, let’s imagine we’re Pumpkin Pie Blueberry Pie
1 walking down the street in
StatLand and we ask the
rst 3 people we meet if
they prefer pumpkin pie or
blueberry pie…
NOTE: 0.147 is the probability of …it is not the probability that 2 out of
observing that the rst two people 3 people prefer pumpkin pie.
prefer pumpkin pie and the third
person prefers blueberry… Let’s nd out why on the next page!!!
38
fi
fi
fi
fi
fi
fi
fi
fi
The Binomial Distribution: Main Ideas Part 2
It could have just as easily been the
4 case that the rst person said they
prefer blueberry and the last two
said they prefer pumpkin. In this case, we would
multiply the numbers
together in a di erent
order, but the probability
would still be 0.147 (see
Appendix A for details).
For example, if we
wanted to
calculate the
probability of
observing that 2 …and there are 10
out of 4 people ways to arrange 3
prefer pumpkin out of 5 people who
pie, we have to prefer pumpkin pie.
calculate and sum UGH!!! Drawing all of
the individual these slices of delicious
probabilities from pie is super tedious.
6 di erent
arrangements… :(
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x
BAM!!!
In the next pages, we'll use the
Binomial Distribution to calculate the
probabilities of pie preference among 3
people, but it works in any situation that
has binary outcomes, like wins and
losses, yeses and noes, or successes
0.3 x 0.7 x 0.7 = 0.147 and failures.
+
0.7 x 0.3 x 0.7 = 0.147 = 0.441
+
0.7 x 0.7 x 0.3 = 0.147 Now that we understand why the
equation for the Binomial Distribution
is so useful, let’s walk through, one step
at a time, how the equation calculates
the probability of observing 2 out of 3
people who prefer pumpkin pie.
40
ff
ff
The Binomial Distribution: Details Part 1
First, let’s focus on just the
1 left-hand side of the equation.
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x
+
YUM! I 0.7 x 0.3 x 0.7 = 0.147
love pie!!!
+
0.7 x 0.7 x 0.3 = 0.147
= 0.441
41
The Binomial Distribution: Details Part 2
Now, let’s look at the rst part on the right-hand side of the
3 equation. StatSquatch says it looks scary because it has factorials
(the exclamation points; see below for details), but it’s not that bad.
Despite the factorials, the rst term simply …and, as we saw earlier, there are 3
represents the number of di erent ways we can di erent ways that 2 out of 3 people
meet 3 people, 2 of whom prefer pumpkin pie… we meet can prefer pumpkin pie.
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x
4 When we plug in
x = 2, the number
…and n = 3, the number of …we get 3, the same
of people who
prefer pumpkin people we asked, and then number we got when we
pie… do the math… did everything by hand.
n! 3! 3! 3x2x1
= = = =3
x! (n - x)! 2! (3 - 2)! 2! (1)! 2x1x1
= 0.441
ff
fi
fi
ff
The Binomial Distribution: Details Part 3
Now let’s look at the second
5 term on the right hand side.
The second term is just the In other words, since p, the …and there are x = 2 people
probability that 2 out of the 3 probability that someone who prefer pumpkin pie, the
people prefer pumpkin pie. prefers pumpkin pie, is 0.7… second term = 0.72 = 0.7 x 0.7.
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x 0.3 x 0.7 x 0.7 = 0.147
The third and nal …because if p is the probability …and if x is the number
6 term is the probability
that 1 out of 3 people
that someone prefers pumpkin of people who prefer
pie, (1 - p) is the probability that pumpkin pie and n is
prefers blueberry pie… someone prefers blueberry pie… the total number of
people we asked, then
n - x is the number of
( x!(n − x)! )
n! people who prefer
p(x | n, p) = p x(1 − p)n−x blueberry pie.
+
0.7 x 0.3 x 0.7 = 0.147
Just so you know, sometimes
people let q = (1 - p), and use q
in the formula instead of (1 - p).
+
0.7 x 0.7 x 0.3 = 0.147
= 0.441
43
fi
The Binomial Distribution: Details Part 4
Now that we’ve looked at each part of the equation for the
7 Binomial Distribution, let’s put everything together and solve for
the probability that 2 out of 3 people we meet prefer pumpkin pie.
( x!(n − x)! )
n!
p(x = 2 | n = 3, p = 0.7) = p x(1 − p)n−x
…then we just
( 2!(3 − 2)! )
do the math… 3!
pr(x = 2 | n = 3, p = 0.5) = 0.72(1 − 0.7)3−2
(Psst! Remember: the rst term is
the number of ways we can
arrange the pie preferences, the
second term is the probability that
pr(x = 2 | n = 3, p = 0.5) = 3 × 0.72 × (0.3)1
2 people prefer pumpkin pie, and
the last term is the probability that
1 person prefers blueberry pie.)
pr(x 2 | n = We’re
=Reminder:
Gentle 3, p = 0.5) = 3 × 0.7 × 0.7 × 0.3
using the equation for the
Binomial Distribution to
pr(x = 2 | n = 3, p = 0.5) = 0.441
calculate the probability
that 2 out of 3 people
prefer pumpkin pie… …and the result is 0.441,
which is the same value we
0.3 x 0.7 x 0.7 = 0.147 got when we drew pictures
of the slices of pie.
+
0.7 x 0.3 x 0.7 = 0.147
+ TRIPLE
BAM!!!
0.7 x 0.7 x 0.3 = 0.147
= 0.441
44
fi
Hey Norm, I sort of understand the
Binomial Distribution. Is there another
commonly used Discrete Distribution that
you think I should know about?
45
The Poisson Distribution: Details
So far, we’ve seen how the Binomial Distribution gives us
1 probabilities for sequences of binary outcomes, like 2 out of 3
people preferring pumpkin pie, but there are lots of other Discrete
Probability Distributions for lots of di erent situations.
To summarize, we’ve seen that …and while these can be useful, they
1 Discrete Probability Distributions require a lot of data that can be
can be derived from histograms… expensive and time-consuming to get,
and it’s not always clear what to do
about the blank spaces.
Shorter Taller
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x
e −λλ x
p(x | λ) =
x!
There are lots of other Discrete Probability Distributions for
4 lots of other types of data. In general, their equations look
intimidating, but looks are deceiving. Once you know what each
symbol means, you just plug in the numbers and do the math.
BAM!!!
Now let’s talk about Continuous
Probability Distributions.
47
Continuous Probability Distributions:
Main Ideas
The Problem: Although they can be super useful,
1 beyond needing a lot of data, histograms have
two problems when it comes to continuous data:
Shorter Taller
Shorter Taller
Shorter Taller
48
fi
The Normal (Gaussian) Distribution:
Main Ideas Part 1
Chances are you’ve It’s also called a Bell-Shaped Curve because it’s
1 seen a Normal or a symmetrical curve…that looks like a bell.
Gaussian
distribution A Normal distribution is symmetrical
before. about the mean, or average, value.
In this example, the
curve represents human
Shorter Average Height Taller Height measurements.
More Likely
Less Likely
Shorter Average Height Taller
49
fi
ff
The Normal (Gaussian) Distribution:
Main Ideas Part 2
The width of a Normal
4 Distribution is de ned by
the standard deviation.
In this example, the standard deviation
= Infant for infants, 1.5, is smaller than the
standard deviation for adults, 10.2.
= Adult
BAM!!!
To draw a Normal Distribution,
you need to know:
1) The Mean or average
measurement. This tells you where
the center of the curve goes.
Hey Norm, can
2) The Standard Deviation of the you tell me what ‘Squatch, he
measurements. This tells you how Gauss was like? was a normal
tall and skinny, or short and fat, the guy!!!
curve should be.
If you don’t already know about
the Mean and Standard
Deviation, check out Appendix B.
50
fi
The Normal (Gaussian) Distribution: Details
The equation for the Normal Distribution looks
1 scary, but, just like every other equation, it’s just a
matter of plugging in numbers and doing the math.
1 −(x−μ)2 /2σ 2
f(x | μ, σ) = e
2πσ 2
To see how the equation for the
2 Normal Distribution works,
let’s calculate the likelihood (the
y-axis coordinate) for an infant
that is 50 cm tall.
Since the mean of the
distribution is also 50 cm,
we’ll calculate the y-axis
coordinate for the highest
part of the curve. 50
Height in cm.
1 −(x−μ)2 /2σ 2
f(x | μ, σ) = e
2πσ 2
3 x is the x-axis
coordinate. So, in The Greek character
Lastly, the Greek character σ,
this example, the x- μ, mu, represents
sigma, represents the standard
axis represents the mean of the
deviation of the distribution. In
Height and x = 50. distribution. In this
this case, σ = 1.5.
case, μ = 50.
1 −(x−μ)2 /2σ 2
f(x = 50 | μ = 50, σ = 1.5) = e
2πσ 2
1 −(50−50)2 /(2×1.52)
= e Now, we just do Remember, the
output from the
2π1.52 the math…
equation, the y-axis
coordinate, is a
1 −02 /4.5
likelihood, not a
= e …and we see that the likelihood,
the y-axis coordinate, for the
probability. In
51
Calculating Probabilities with Continuous
Probability Distributions: Details
For Continuous Probability Distributions, probabilities
1 are the area under the curve between two points. For example, given this
Normal Distribution with
mean = 155.7 and standard
deviation = 6.6, the probability
of getting a measurement
between 142.5 and 155.7
cm…
142.5 cm 155.7 cm 168.9 cm
Height in cm
…is equal to this area under the curve, which in this example is 0.48.
So, the probability is 0.48 that we will measure someone in this range.
∫a
…the total area under its curve is 1. Meaning, f(x) dx See Appendix C for
the probability of measuring anything in the a list of commands.
range of possible values is 1. UGH!!! NO ONE Area = 0.48
ACTUALLY DOES THIS!!!
BAM!!!
One confusing thing about
4 Continuous Distributions is that
the likelihood for a speci c
measurement, like 155.7, is the y-
axis coordinate and > 0… One way to understand
…but the probability
Likelihood = why the probability is 0 is
for a speci c
to remember that
measurement is
probabilities are areas,
always 0. and the area of something
with no width is 0.
53
Other Continuous Distributions: Main Ideas
Exponential Distributions are commonly used when
1 we’re interested in how much time passes between
events. For example, we could measure how many
minutes pass between page turns in this book.
More Likely
Less Likely
Less Likely
0 1 0 5 0 1 3.5 5
181 cm
152 cm
Shorter Taller
…and, additionally,
2 Continuous Distributions
also spare us from having
to decide how to bin the
data.
Shorter Taller
Instead, Continuous Distributions
3 use equations that represent smooth
curves and can provide likelihoods
vs.
and probabilities for all possible
measurements. Taller
Shorter
1 −(x−μ)2 /2σ 2
f(x | μ, σ) = e
2πσ 2
Like Discrete Distributions, there In the context of machine learning, both types
4 are Continuous Distributions for
all kinds of data, like the values we
of distributions allow us to create Models that
can predict what will happen next.
get from measuring people’s
height or timing how long it takes So, let’s talk about what Models are and how
you to read this page. to use them.
(small but mighty) BAM!!!
55
Models: Main Ideas Part 1
The Problem: Although we could
1 spend a lot of time and money to
build a precise histogram…
Shorter Taller
Shorter Taller
A Solution: A statistical,
2 mathematical, or machine
learning Model provides an
approximation of reality that
we can use in a wide variety
of ways.
Weight
Shorter Taller
56
fi
Models: Main Ideas Part 2
As we saw in Chapter 1, models need Training
3 Data. Using machine learning lingo, we say that we
build models by training machine learning
algorithms.
Models, or equations, can
4 tell us about people we
haven’t measured yet.
Height
For example, if we
Height = 0.5 + (0.8 x Weight) want to know how tall
X someone is who
weighs this much…
Weight Height
Weight
X …we plug the
Weight into the
equation and solve
for Height…
Height = 0.5 + (0.8 x Weight)
Because models are only
5 approximations, it’s important Height = 0.5 + (0.8 x 2.1)
…and get
that we’re able to measure the Height = 2.18 2.18.
quality of their predictions.
These green lines show the
distances from the model’s
predictions to the actual data points.
6 In summary:
Bam!
57
fi
The Sum of the Squared Residuals:
Main Ideas Part 1
The Problem: We have a model that makes
1 predictions. In this case, we’re using Weight to
predict Height. However, we need to quantify the Height
quality of the model and its predictions.
Weight
A Solution: One way quantify the quality of a
2 model and its predictions is to calculate the
Sum of the Squared Residuals.
As the name implies, we start by calculating
Residuals, the di erences between the Observed
Visually, we can draw
values and the values Predicted by the model.
Residuals with these
green lines.
∑
each Observation. SSR = (Observedi - Predictedi)2 summation.
For example, i = 1
refers to the rst i=1
Observation.
NOTE: Squaring, as opposed to taking the
absolute value, makes it easy to take the
derivative, which will come in handy when we
do Gradient Descent in Chapter 5.
58
ff
fi
fi
ff
fi
fi
The Sum of the Squared Residuals:
Main Ideas Part 2
So far, we’ve looked at the SSR only in terms
3 of a simple straight line model, but we can
calculate it for all kinds of models.
Rainfall
Time
…because, in
this example,
Height X perpendicular
lines result in
Height
X di erent Weights
for the Observed
and Predicted
Heights.
Weight
Weight
In contrast, the vertical distance allows both
the Observed and Predicted Heights to
correspond to the same Weight.
n
The Sum of Squared =
∑
(Observedi - Predictedi)2
Residuals (SSR)
i=1 Observed =
Predicted =
Residual =
+ (Observed3 - Predicted3)2
For i = 3, the term
Now, we just do the math, for the third
3 and the nal Sum of Squared Observation…
Residuals (SSR) is 0.69.
(2.9 - 2.2)2
BAM!!!
Don’t get me wrong, the SSR
is awesome, but it has a pretty
big problem that we’ll talk
about on the next page.
60
fi
fi
Mean Squared Error (MSE): Main Ideas
The Problem: Sum of the Squared Residuals (SSR),
1 although awesome, is not super easy to interpret because
it depends, in part, on how much data you have.
= ∑
(Observedi - Predictedi)2
i=1 n
61
fi
ff
fi
fi
fi
Mean Squared Error (MSE): Step-by-Step
Now let’s see the MSE in
1 action by calculating it for
the two datasets!!!
The rst dataset has only 3 points The second dataset has 5 points and the
2 and the SSR = 14, so the Mean
Squared Error (MSE) is 14/3 = 4.7.
SSR increases to 22. In contrast, the
MSE, 22/5 = 4.4, is now slightly lower.
SSR 14 SSR 22
= = 4.7 = = 4.4
n 3 n 5
So, unlike the SSR, which increases when we add more data
to the model, the MSE can increase or decrease depending on
the average residual, which gives us a better sense of how the
model is performing overall.
calculate something
called R2, which is
10 0.01 independent of both
the size of the dataset
and the scale, so keep
reading!
62
fi
ffi
R2: Main Ideas In this example, changing the units from
millimeters to meters reduced the MSE by a lot!!!
The Problem: As we just
1 saw, the MSE, although MSE = 4.7 MSE = 0.0000047
totally cool, can be di cult
to interpret because it
depends, in part, on the
scale of the data.
millimeters meters
Height Height
Weight Weight
63
fi
ffi
R2: Details Part 1
First, we calculate the Sum of the Squared SSR(mean) = (2.3 - 1.9)2
1 Residuals for the mean. We’ll call this SSR
the SSR(mean). In this example, the mean
+ (1.2 - 1.9)2 + (2.7 - 1.9)2
Height is 1.9 and the SSR(mean) = 1.6. + (1.4 - 1.9)2 + (2.2 - 1.9)2
= 1.6
Height
2 Then, we calculate the SSR for the tted
line, SSR( tted line), and get 0.5.
NOTE: The smaller
Residuals around the
Weight tted line, and thus
the smaller SSR given
the same dataset,
suggest the tted line
Height does a better job
making predictions
than the mean.
3
Weight
Now we can calculate the
R2 value using a surprisingly SSR( tted line) = (1.2 - 1.1)2 + (2.2 - 1.8)2 + (1.4 - 1.9)2
simple formula…
+ (2.7 - 2.4)2 + (2.3 - 2.5)2
= 0.5
SSR(mean) - SSR( tted line)
R2 =
SSR(mean)
In general, because the
4 numerator for R2…
= 1.6 - 0.5 …and the result, 0.7, tells us
1.6 that there was a 70% SSR(mean) - SSR( tted line)
reduction in the size of the
= 0.7 Residuals between the
…is the amount by which the SSRs
shrank when we tted the line, R2
mean and the tted line.
values tell us the percentage the
Residuals around the mean shrank
when we used the tted line.
…the Residuals
SSR(mean) - 0 around a tted line
SSR(mean) will always be 0…
In contrast, when we see …we can, intuitively, have more con dence that a large
7 a trend in a large amount
of data like this…
R2 is not due to random chance because the data are
not scattered around randomly like we might expect.
Height
vs.
Weight
R2
…we end up with R2 times 1, which is just R2. So, we
= can calculate R2 with the SSR or MSE, whichever is
readily available. Either way, we’ll get the same value.
Rainfall
Time Time
Can R2 be negative?
When we’re only comparing the
mean to a tted line, R2 is positive, …we get a negative R2 value, -1.4,
but when we compare other types of and it tells us the Residuals
models, anything can happen. increased by 140%.
For example, if we use R2 to compare
a straight line to a parabola…
SSR(line) - SSR(parabola)
R2 =
SSR(line)
5 - 12
R2 = = -1.4
5
Drug A Drug B
Now, it’s pretty obvious that Drug A is di erent from Drug B because it would be unrealistic
to suppose that these results were due to just random chance and that there’s no real
di erence between Drug A and Drug B.
It’s possible that some of the people taking Drug A were actually cured by placebo, and
some of the people taking Drug B were not cured because they had a rare allergy, but there
are just too many people cured by Drug A, and too few cured by Drug B, for us to seriously
think that these results are just random and that Drug A is no di erent from Drug B.
68
ff
ff
ff
fi
fi
ff
ff
ff
ff
ff
p-values: Main Ideas Part 2
In contrast, let’s say that these were the results…
4 Drug A Drug B
…and 37% of the people who took Drug A were cured compared to 31%
who took Drug B.
Drug A cured a larger percentage of people, but given that no study is
perfect and there are always a few random things that happen, how
con dent can we be that Drug A is di erent from Drug B?
This is where p-values come in. p-values are numbers between 0 and 1
that, in this example, quantify how con dent we should be that Drug A is
di erent from Drug B. The closer a p-value is to 0, the more con dence we
have that Drug A and Drug B are di erent.
So, the question is, “how small does a p-value have to be before we’re
su ciently con dent that Drug A is di erent from Drug B?”
In other words, what threshold can we use to make a good decision about
whether these drugs are di erent?
TERMINOLOGY ALERT!!!
Getting a small p-value when there is no
di erence is called a False Positive.
A 0.05 threshold for p-values means that 5% of the experiments, where the only
di erences come from weird, random things, will generate a p-value smaller than 0.05.
In other words, if there’s no di erence between Drug A and Drug B, in 5% of the times we
do the experiment, we’ll get a p-value less than 0.05, and that would be a False Positive.
70
ff
ff
ff
fi
ff
ff
p-values: Main Ideas Part 4
If it’s extremely important that we’re correct when we
9 say the drugs are di erent, then we can use a smaller
threshold, like 0.01 or 0.001 or even smaller.
Using a threshold of 0.001 would get a False
Positive only once in every 1,000 experiments.
Likewise, if it’s not that important (for example, if
we’re trying to decide if the ice-cream truck will arrive
on time), then we can use a larger threshold, like 0.2.
Using a threshold of 0.2 means we’re willing to get a
False Positive 2 times out of 10.
That said, the most common threshold is 0.05
because trying to reduce the number of False
Positives below 5% often costs more than it’s worth.
Drug A Drug B
TERMINOLOGY ALERT!!!
In fancy statistical lingo, the idea of trying
to determine if these drugs are the same
or not is called Hypothesis Testing.
BAM!!!
The Null Hypothesis is that the drugs are
the same, and the p-value helps us
decide if we should reject the Null
Hypothesis.
71
ff
ff
fi
ff
p-values: Main Ideas Part 5
While a small p-value helps us decide if Drug A is di erent
11 from Drug B, it does not tell us how di erent they are.
In other words, you can have a small p-value regardless of
the size of the di erence between Drug A and Drug B.
The di erence can be tiny or huge.
For example, this experiment gives us a relatively large
p-value, 0.24, even though there’s a 6-point di erence
between Drug A and Drug B.
Drug A Drug B
In contrast, this experiment
involving a lot more people
Cured!!! Not Cured Cured!!! Not Cured gives us a smaller p-value,
73 125 59 131 0.04, even though there’s
only a 1-point di erence
37% Cured 31% Cured between Drug A and Drug B.
Drug A Drug B
DOUBLE
BAM!!!
Now that we understand
the main ideas of p-values,
let’s summarize the main
ideas of this chapter.
72
ff
ff
ff
ff
ff
ff
ff
ff
The Fundamental Concepts of Statistics:
Summary
We can see trends in data with histograms.
1 We’ll learn how to use histograms to make
classi cations with Naive Bayes in Chapter 7. 2 However, histograms have
limitations (they need a lot of data
and can have gaps), so we also use
probability distributions to
represent trends. We’ll learn how to
use probability distributions to
Shorter Taller make classi cations with Naive
Bayes in Chapter 7.
Shorter Taller
3 Rather than collect all of the data in the whole wide world, which would take
forever and be way too expensive, we use models to approximate reality.
Histograms and probability distributions are examples of models that we can
use to make predictions. We can also use a mathematical formula, like the
equation for the blue line, as a model that makes predictions.
Linear
Regression!!!
Linear Regression: Main Ideas
The Problem: We’ve collected
1 Weight and Height measurements
from 5 people, and we want to use …and in Chapter 3, we learned that
Weight to predict Height, which is we could t a line to the data and
continuous… use it to make predictions.
However, 1) we
didn’t talk about
how we t a line
to the data and 2)
Height X
Height
we didn’t calculate a
p-value for the tted
line, which would
quantify how much
con dence we should
have in its predictions
Weight X Weight
compared to just
using the mean y-axis
value.
Time
76
fi
fi
fi
fi
fi
fi
fi
fi
fi
Fitting a Line to Data: Main Ideas
Imagine we had Because the heavier Weights are
1 Height and Weight
data on a graph…
2 paired with taller Heights, this line
makes terrible predictions.
…and we
wanted to
Height
predict
Height from Height
Weight.
Weight
Weight
Weight
Residuals
(SSR). BAM!!!
Then we can plot the SSR on
this graph that has the SSR on SSR
the y-axis, and di erent lines t
to the data on the x-axis.
Weight
Height Height
Weight Weight
77
ff
ff
ff
fi
ff
Fitting a Line to Data: Intuition
If we don’t change the slope, …and, in this case, the goal of Linear
1 we can see how the SSR
changes for di erent y-axis
Regression would be to nd the y-axis
intercept that results in the lowest SSR at the
intercept values… bottom of this curve.
Height
Height
SSR
Weight
Weight
Weight Weight
Height Height
SSR( tted line) = 0.55 SSR(mean) = 1.61
1.61 - 0.55
Weight Weight R2 = = 0.66
1.61
The R2 value, 0.66, suggests that using Weight to predict Height will be
2 useful, but now we need to calculate the p-value to make sure that this
result isn’t due to random chance.
In this context, a p-value tells us the probability that random data could
result in a similar R2 value or a better one. In other words, the p-value will tell
us the probability that random data could result in an R2 ≥ 0.66.
So far, the example we’ve used …and, as we’ve seen, Simple Linear
1 demonstrates something called Simple
Linear Regression because we use one
Regression ts a line to the data that
we can use to make predictions.
variable, Weight, to predict Height…
Weight
…and instead of a
tting a line to the
…one for Height… Height data, we t a plane.
Shoe Size
80
fi
fi
ff
ff
fi
fi
fi
Beyond Linear Regression
As mentioned at the start of this
1 chapter, Linear Regression is the 2
Linear Models allow us to use discrete
data, like whether or not someone loves
gateway to something called the movie Troll 2, to predict something
Linear Models, which are continuous, like how many grams of
incredibly exible and powerful. Popcorn they eat each day.
Linear Models also easily combine In this case, adding how much Soda
4 discrete data, like whether or not
someone loves Troll 2, with
Pop someone drinks to the model
dramatically increased the R2 value,
continuous data, like how much which means the predictions will be
Soda Pop they drink, to predict more accurate, and reduced the p-
something continuous, like how value, suggesting we can have more
much Popcorn they will eat. con dence in the predictions.
R2 = 0.97
Popcorn
(g)
p-value = 0.006
DOUBLE
BAM!!!
Soda Pop
= Loves Troll 2
If you’d like to learn more
= Does Not Love Troll 2
5 about Linear Models,
scan, click, or tap this QR
code to check out the
‘Quests on YouTube!!!
81
fi
fl
fi
fi
bam.
82
fi
Chapter 05
Gradient
Descent!!!
Gradient Descent: Main Ideas
The Problem: A major part of machine
1 learning is optimizing a model’s t to the data.
Sometimes this can be done with an analytical
solution, but it’s not always possible.
BAM!!!
84
fi
fi
fi
fi
Gradient Descent: Details Part 1
Let’s show how Gradient Descent
1 ts a line to these Height and
Weight measurements.
Weight
85
fi
fi
Gradient Descent: Details Part 2
In this example, we’re tting a line to data, and Remember, Residuals are the
4 we can evaluate how well that line ts with the
Sum of the Squared Residuals (SSR).
di erence between the Observed
and Predicted values.
The Observed Heights are the …and the Predicted Heights come
values we originally measured… from the equation for the line…
Weight
86
ff
fi
fi
Gradient Descent: Details Part 3
In this rst example, Now, to calculate the SSR, we
6 since we’re only 7 rst plug the value for the y-axis
optimizing the y-axis intercept, 0, into the equation
intercept, we’ll start by we derived in Steps 4 and 5…
assigning it a random
value. In this case, we’ll
initialize the intercept
by setting it to 0.
SSR = (Observed Height1 - (intercept + 0.64 x Weight1))2
SSR
intercept
TERMINOLOGY ALERT!!!
The terms Loss Function and Cost Function refer to anything we want to optimize when we t
a model to data. For example, we might want to optimize the SSR or the Mean Squared Error
(MSE) when we t a straight line with Regression or a squiggly line (in Neural Networks). That
said, some people use the term Loss Function to speci cally refer to a function (like the SSR)
applied to only one data point, and use the term Cost Function to speci cally refer to a function
(like the SSR) applied to all of the data. Unfortunately, these speci c meanings are not universal,
so be aware of the context and be prepared to be exible. In this book, we’ll use them together
and interchangeably, as in “The Loss or Cost Function is the SSR.”
88
fi
fl
fi
fi
fi
fi
Gradient Descent: Details Part 5
Now, when we started with the y-axis …so how do we take steps
12 intercept = 0, we got this SSR… toward this y-axis intercept that
gives us the lowest SSR…
intercept
SSR
intercept
X
89
Gradient Descent: Details Part 6
…wrapped in …one way to take the
Because a single term of the
16 SSR consists of a Residual…
parentheses and
squared…
derivative of the SSR is
to use The Chain Rule
(see Appendix F if you
need to refresh your
SSR = ( Height - (intercept + 0.64 x Weight) )2 memory about how The
Chain Rule works).
d SSR
= -2 x ( Height - (intercept + 0.64 x Weight) )
d intercept
+ -2 x ( Height - (intercept + 0.64 x Weight) )
Gentle Reminder: Because we’re using Linear Regression as our example, we don’t
actually need to use Gradient Descent to nd the optimal value for the intercept. Instead, we
could just set the derivative equal to 0 and solve for the intercept. This would be an
analytical solution. However, by applying Gradient Descent to this problem, we can compare
the optimal value that it gives us to the analytical solution and evaluate how well Gradient
Descent performs. This will give us more con dence in Gradient Descent when we use it in
situations without analytical solutions like Logistic Regression and Neural Networks.
91
fi
fi
fi
Terminology Alert!!! Parameters
OH NO!!! MORE
TERMINOLOGY!!!
92
Gradient Descent for One Parameter:
Step-by-Step
First, plug the Observed values into …so that means plugging
1 the derivative of the Loss or Cost
Function. In this example, the SSR
the Observed Weight and
Height measurements into
is the Loss or Cost Function… the derivative of the SSR.
Height
+ -2 x ( Height - (intercept + 0.64 x Weight) )
Weight
93
Gradient Descent for One Parameter:
Step-by-Step
Now evaluate the derivative at the
3 current value for the intercept. In
this case, the current value is 0.
When we do
the math, we
…thus, when the
intercept = 0, the slope of
get -5.7… this tangent line is -5.7.
y-axis intercept
Step Size = Derivative x Learning Rate NOTE: The Step Size is smaller
than before because the slope
Step Size = -2.3 x 0.1 of the tangent line is not as
steep as before. The smaller
Step Size = -0.23 slope means we’re getting
closer to the optimal value.
SSR
y-axis intercept
95
Gradient Descent for One Parameter:
Step-by-Step
7 After 7 iterations of
Gradient Descent…
…and we made it to
…the Step Size was very close the lowest SSR.
to 0, so we stopped with the
current intercept = 0.95…
SSR
y-axis intercept
BAM???
Not yet! Now let’s see how well
Gradient Descent optimizes the
intercept and the slope!
96
Optimizing Two or More Parameters: Details
Now that we know how to optimize the
1 intercept of the line that minimizes the SSR,
let’s optimize both the intercept and the slope.
Weight
This axis
represents
di erent values
for the slope…
Weight
d SSR
= -2 x Weight1 x ( Height1 - (intercept + slope x Weight1) )
d slope
+ -2 x Weight2 x ( Height2 - (intercept + slope x Weight2) )
+ -2 x Weight3 x ( Height3 - (intercept + slope x Weight3) )
d SSR
= -2 x 2.9 ( 3.2 - (intercept + slope x 2.9) )
d slope
+ -2 x 2.3 ( 1.9 - (intercept + slope x 2.3) )
+ -2 x 0.5 ( 1.4 - (intercept + slope x 0.5) )
100
Gradient Descent for Two Parameters:
Step-by-Step
Height
d SSR
= -2 x 2.9 x ( 3.2 - (intercept + slope x 2.9) )
d slope
+ -2 x 2.3 x ( 1.9 - (intercept + slope x 2.3) )
+ -2 x 0.5 x ( 1.4 - (intercept + slope x 0.5) )
d SSR
= -2 x 2.9 x ( 3.2 - (0 + 0.5 x 2.9) )
d slope
+ -2 x 2.3 x ( 1.9 - (0 + 0.5 x 2.3) )
+ -2 x 0.5 x ( 1.4 - (0 + 0.5 x 0.5) )
101
Gradient Descent for Two Parameters:
Step-by-Step
Evaluate the derivatives at the
3 current values for the intercept, 0,
and slope, 0.5.
d SSR
= -2 x 2.9 x ( 3.2 - (0 + 0.5 x 2.9) )
d slope
+ -2 x 1.9 x ( 2.3 - (0 + 0.5 x 1.9) ) = -14.8
+ -2 x 0.5 x ( 1.4 - (0 + 0.5 x 0.5) )
4 Calculate the Step Sizes:
one for the intercept…
Evaluate the
Calculate the Calculate the
a derivatives at their b Step Sizes… c new values…
current values…
…this axis is
for the SSR…
…and this axis
represents
di erent values
TRIPLE
for the intercept.
BAM!!!
Gradient Descent is awesome,
but when we have a lot of data or a
lot of parameters, it can be slow. Is
there any way to make it faster?
103
ff
ff
fi
Stochastic Gradient Descent: Main Ideas
So far, things have been pretty …and we t a straight line,
1 simple. We started out with a tiny which only has 2 parameters,
dataset that only had 3 points… the intercept and the slope.
Height
2 Because there were only 2 parameters, we only had
2 derivatives that we had to calculate in each step…
d SSR d SSR
Weight d intercept d slope
Weight
Height = 0 + 0.5 x Weight
Height
Height
d SSR
= -2 x Weight x ( Height - (intercept + slope x Weight) )
d slope
8 TERMINOLOGY ALERT!!!
Although a strict de nition of
Stochastic Gradient Descent says
that we only select a single point per
iteration, it’s much more common to
randomly select a small subset of
the observations. This is called Mini-
Batch Stochastic Gradient
Descent. Using a small subset,
rather than a single point, usually
converges on the optimal values in
fewer steps and takes much less
time than using all of the data.
Height
Weight
106
fi
Gradient Descent: FAQ
Will Gradient Descent always nd
the best parameter values?
When this happens (when we get stuck in a local minimum instead of nding the
global minimum), it’s a bummer. Even worse, usually it’s not possible to graph the
SSR, so we might not even know we’re in one, of potentially many, local
minimums. However, there are a few things we can do about it. We can:
107
fi
ff
fi
fi
ff
fi
fi
Chapter 06
Logistic
Regression!!!
Logistic Regression: Main Ideas Part 1
but what if we want to classify something discrete that only has two = Does Not
possibilities, like whether or not someone loves the movie Troll 2? Love Troll 2
BAM!!!
One thing that is confusing about
Logistic Regression is the name;
it is used for Classi cation, not
regression!!!
109
fi
fi
fi
fi
fi
fi
Logistic Regression: Main Ideas Part 2
The y-axis on a Logistic The colored dots are the Training Data,
3 Regression graph which we used to t the squiggle. The
represents probability and data consist of 4 people who Do Not Love
goes from 0 to 1. In this Troll 2 and 5 people who Love Troll 2.
case, it’s the probability that
someone loves Troll 2.
4 If someone new
comes along and …then the squiggle tells
tells us that they ate us that there’s a relatively
this much Popcorn… high probability that that
person will love Troll 2.
1 = Loves Troll 2
X
Probability Speci cally, the
that someone corresponding y-axis value
loves Troll 2
on the squiggle tells us that
the probability that this
Does Not person loves Troll 2 is 0.96.
0=
Love Troll 2
Popcorn (g)
X
= Loves Troll 2
= Does Not
DOUBLE BAM!!!
Love Troll 2
110
fi
fi
Logistic Regression: Main Ideas Part 3
Now that we know the probability that this person will
5 love Troll 2, we can classify them as someone who
either Loves Troll 2 or Does Not Love Troll 2. Usually …and in this case, that
the threshold for classi cation is 0.5… means anyone with a
probability of loving
1 = Loves Troll 2 Troll 2 > 0.5 will be
classi ed as someone
Probability who Loves Troll 2…
that someone
loves Troll 2 ….and anyone with a
probability ≤ 0.5 will be
classi ed as someone who
Does Not
0= Does Not Love Troll 2.
Love Troll 2
Popcorn (g)
1 = Loves Troll 2
X
Probability
that someone
loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
X
One last thing before we go: In this example, the
7 classi cation threshold was 50%. However, when
we talk about Receiver Operator Curves (ROCs)
in Chapter 8, we’ll see examples that use di erent
classi cation thresholds. So get excited!!!
TRIPLE BAM!!!
In a few pages, we’ll talk
about how we t a squiggle
to Training Data. However,
before we do that, we need
to learn some fancy
terminology.
111
fi
fi
fi
fi
fi
fi
ff
Terminology Alert!!! Probability vs. Likelihood:
Part 1
Hey Norm, don’t the words
Probability and Likelihood
mean the same thing?
Less Likely
Shorter Average Height Taller
112
fi
ff
ff
fi
Terminology Alert!!! Probability vs. Likelihood:
Part 2
In contrast, later in Chapter 3, we saw that
2 Probabilities are derived from a Normal
Distribution by calculating the area under
the curve between two points.
Less Likely
142.5 cm 155.7 cm 168.9 cm
I think I understand
Probabilities, but what do
we do with Likelihoods?
113
fi
fi
Terminology Alert!!! Probability vs. Likelihood:
Part 3
Likelihoods are often used to evaluate how
4 well a statistical distribution ts a dataset.
For example, imagine we collected these
three Height measurements…
…and we wanted to compare the t
of a Normal Curve that has its peak
to the right of the data…
…to the t of a Normal
155.7 cm
Curve that is centered
over the data.
155.7 cm 168.9 cm
Probability
that someone
Weight loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
For example, to calculate the Likelihood for this …and this y-axis
3 person, who Loves Troll 2, we use the squiggle
to nd the y-axis coordinate that corresponds to
coordinate, 0.4, is
both the predicted
the amount of Popcorn they ate… probability that they
Love Troll 2 and the
Likelihood.
1 = Loves Troll 2
Probability
Likewise, the
4
that someone
loves Troll 2
X Likelihood for this
person, who also
0=
Does Not Loves Troll 2, is the y-
Love Troll 2
X
Popcorn (g)
axis coordinate for the
squiggle, 0.6, that
corresponds to the
amount of Popcorn
they ate.
1 = Loves Troll 2
Probability
that someone
X
loves Troll 2
Does Not
0=
Love Troll 2
X
Popcorn (g)
115
fi
fi
fi
Fitting A Squiggle To Data: Main Ideas Part 2
1 = Loves Troll 2
The good news is, because
someone either loves Troll 2 or Probability
they don’t, the probability that that someone
someone does not love Troll 2 is loves Troll 2
just 1 minus the probability that
they love Troll 2… Does Not
0=
Love Troll 2
Popcorn (g)
p(Does Not Love Troll 2) = 1 - p(Loves Troll 2)
1 = Loves Troll 2
Probability
that someone
loves Troll 2
0=
Does Not …we rst calculate the
Love Troll 2 Likelihood that they
Popcorn (g) Love Troll 2, 0.8…
1 = Loves Troll 2
Probability
X …and then use that value
to calculate the Likelihood
that someone
loves Troll 2
that they Do Not Love
Troll 2 = 1 - 0.8 = 0.2.
Does Not
0=
Love Troll 2 Bam!
Popcorn (g)
116
fi
ff
Fitting A Squiggle To Data: Main Ideas Part 3
Now that we know how to calculate the Likelihoods for
7 people who Love Troll 2 and people who Do Not Love Troll
2, we can calculate the Likelihood for the entire squiggle by
…and when we
do the math, we
multiplying the individual Likelihoods together… get 0.02.
1 = Loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
vs.
Now we calculate the Likelihood
8 for a di erent squiggle…
…and compare the total
Likelihoods for both
1 = Loves Troll 2
squiggles.
Probability
that someone 0.1 x 0.2 x 0.6 x 0.7 x 0.9
loves Troll 2
x 0.9 x 0.9 x 0.9 x 0.8 = 0.004
Does Not
0=
Love Troll 2
Popcorn (g)
In practice, we usually nd
the optimal squiggle using
Gradient Descent.
TRIPLE BAM!!!
117
ff
fi
fi
Fitting A Squiggle To Data: Details
The example we’ve used so far has
1 a relatively small Training Dataset
…so when we multiplied the 9
Likelihoods together, it was easy,
of only 9 points total… and we got 0.02.
1 = Loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
Probability
3 Regression to t an s-
shaped squiggle to the
that someone data, we would get a
loves Troll 2 horrible model that
misclassi ed everyone who
Does Not ate a lot of Popcorn as
0=
Love Troll 2 someone who loves Troll 2.
Popcorn (g)
119
fi
fi
fi
fi
fi
Chapter 07
Naive Bayes!!!
Naive Bayes: Main Ideas
p( N ) x p( Dear | N ) x p( Friend | N )
p( S ) x p( Dear | S ) x p( Friend | S )
Then we compare that to
5 the Prior Probability, …multiplied by the probabilities of
another initial guess, that seeing the words Dear and Friend,
the message is Spam… given that the message is Spam.
121
ff
fi
fi
ff
fi
Multinomial Naive Bayes: Details Part 1
There are several types of Naive
1 Bayes algorithms, but the most
commonly used version is called
Multinomial Naive Bayes.
p( Dear | N ) = 0.47
Then we do the same thing
for all the other words in the p( Friend | N ) = 0.29
Normal messages.
p( Lunch | N ) = 0.18
p( Money | N ) = 0.06
122
Multinomial Naive Bayes: Details Part 2
Then we calculate the probabilities
6 of seeing each word given that
they came from Spam messages.
Dear Friend Lunch Money
p( Dear | S ) = 0.29
8
p( N ) = # of Normal Messages = = 0.67
Total # of Messages 12
4
p( S ) = # of Spam Messages = = 0.33
Total # of Messages 12
p( Dear | N ) = 0.47
p( N ) = 0.67 p( Friend | N ) = 0.29 Dear
p( Lunch | N ) = 0.18 Friend
p( Money | N ) = 0.06
…by multiplying the Prior …by the probabilities of seeing …and when we
11 Probability that the
message is Normal…
the words Dear and Friend in
Normal messages…
do the math, we
get 0.09.
Likewise, using the …and the probabilities …we can calculate the overall
12 Prior Probability for for each word occurring score that the message, Dear
Spam messages… in Spam messages… Friend, is Spam, and when
we do the math, we get 0.01.
p( Dear | S ) = 0.29
p( S ) = 0.33 p( Friend | S ) = 0.14
p( Lunch | S ) = 0.00
p( Money | S ) = 0.57
124
Multinomial Naive Bayes: Details Part 4
Now, remember the goal was to
13 classify the message Dear Friend
as either Normal or Spam…
…and we created
histograms of the words in
the messages…
…then, using the Prior Probabilities …and now we can nally classify Dear
and the probabilities for each word, Friend!!! Because the score for Normal
given that it came from either a (0.09) is greater than Spam (0.01), we
Normal message or Spam, we classify Dear Friend as a Normal message.
calculated scores for Dear Friend….
Dear Friend
p( N ) x p( Dear | N ) x p( Friend | N ) = 0.67 x 0.47 x 0.29 = 0.09 BAM!!!
p( S ) x p( Dear | S ) x p( Friend | S ) = 0.33 x 0.29 x 0.14 = 0.01
125
fi
Multinomial Naive Bayes: Details Part 5
The reason Naive Bayes is naive, is that
14 it’s simple: it treats all of the words
independently, and this means it ignores
word order and phrasing.
So, both Dear Friend and Friend Dear get the same score, 0.09.
16 For example, if we
see this message… Money Money Money Money Lunch
…which looks like it could be Spam because it
has the word Money in it a bunch of times.
Soda Pop
mean = 4
sd = 2
mean = 24
sd = 4
Popcorn
Soda Pop
Popcorn Soda Pop Candy
(grams) (ml) (grams)
24.3 750.7 0.2
28.2 533.2 50.5 …and for Candy.
etc. etc. etc.
Candy
129
Gaussian Naive Bayes: Details Part 2
Now we calculate the Prior Just like for Multinomial Naive Bayes, this Prior
2 Probability that someone
Loves Troll 2.
Probability is just a guess and can be any probability
we want. However, we usually estimate it from the
number of people in the Training Data.
4
p( Loves Troll 2 ) = # of People Who Love Troll 2 = = 0.6
Total # of People 4+3
130
Gaussian Naive Bayes: Details Part 3
…we calculate the score for Loves
5 Troll 2 by multiplying the Prior
Probability that they Love Troll 2…
X Popcorn
X
p( Loves Troll 2 )
X
x L( Popcorn = 20 | Loves ) Soda Pop
x L( Soda Pop = 500 | Loves )
x L( Candy = 100 | Loves ) X
…by the Likelihoods, the y-axis
coordinates, that correspond to 20 Candy
grams of Popcorn, 500 ml of Soda
Pop, and 100 grams of Candy, X
given that they Love Troll 2.
log(0.6 x 0.06 x 0.004 x a tiny number) = log(0.6) + log(0.06) + log(0.004) + log(a tiny number)
131
fl
Gaussian Naive Bayes: Details Part 4
Likewise, we calculate the score for
8 Does Not Love Troll 2 by multiplying
the Prior Probability…
Popcorn
X
p( No Love )
x L( Popcorn = 20 | No Love ) Soda Pop
x L( Soda Pop = 500 | No Love )
x L( Candy = 100 | No Love ) X
log(p( No Love ))
+ log(L( Popcorn = 20 | No Love )) …then we plug in the
numbers, do the math,
+ log(L( Soda Pop = 500 | No Love )) and get -48.
+ log(L( Candy = 100 | No Love ))
Lastly, because the score for Does Not Love …as someone who
9 Troll 2 (-48) is greater than the score for Loves Does Not Love Troll 2.
Troll 2 (-124), we classify this person…
If my continuous data are not Gaussian, can I still use Gaussian Naive Bayes?
X
Less Likely
X
0 Time Spent Reading 5 10 15
Silly Bam. :)
133
fi
What if some of my data are discrete
Naive Bayes: FAQ Part 2 and some of my data are continuous?
Can I still use Naive Bayes?
TRIPLE SPAM!!!
What do you call it when we
mix Naive Bayes techniques
together? I don’t know, Norm.
Maybe “Naive
Bayes Deluxe!”
134
Naive Bayes: FAQ Part 3
p( N | Dear Friend) =
p( S ) x p( Dear | S ) x p( Friend | S )
p( N ) x p( Dear | N ) x p( Friend | N ) + p( S ) x p( Dear | S ) x p( Friend | S )
p( S | Dear Friend) =
p( N ) x p( Dear | N ) x p( Friend | N )
p( N ) x p( Dear | N ) x p( Friend | N ) + p( S ) x p( Dear | S ) x p( Friend | S )
small bam.
135
ff
fi
Chapter 08
Assessing Model
Performance!!!
Assessing Model Performance: Main Ideas
The Problem: We’ve got this Good
1 dataset, and we want to use Chest Blocked Heart
Blood Weight
Pain Arteries Disease
it to predict if someone has Circ.
Heart Disease, but we don’t
No No No 125 No
know in advance which
model will make the best Yes Yes Yes 180 Yes
predictions.
Yes Yes No 210 No
Should we choose … … … … …
Naive Bayes…
…or Logistic
Regression?
Weight
Predicted
Has Heart Does Not Have …from Confusion
Disease Heart Disease Matrices, simple grids
Has Heart that tell us what each
Disease 142 22 model did right and what
Actual
Good … … … … …
Chest Blocked Heart
Blood Weight
Pain Arteries Disease
Circ.
No No No 125 No
2 We start by
dividing the data
… … … … … into two blocks…
Good
Chest Blocked Heart
Blood Weight
Pain Arteries Disease …and we use the rst block
Circ.
to optimize both models using
Yes Yes No 210 No Cross Validation.
… … … … …
Then we apply the optimized Naive Bayes model …2) the number of
3 to the second block and build a Confusion Matrix
that helps us keep track of 4 things: 1) the number
people with Heart
Disease who were
of people with Heart Disease who were correctly incorrectly predicted to
predicted to have Heart Disease… not have Heart Disease…
Predicted
…3) the number of people
Has Heart Does Not Have
without Heart Disease who
Disease Heart Disease
were correctly predicted to
Has Heart not have Heart Disease…
Disease 142 22
Actual
Yes
No 29 115
138
BAM!!
fi
The Confusion Matrix: Details Part 1
When the actual and predicted …when the actual value is YES,
1 values are both YES, then we
call that a True Positive…
but the predicted value is NO, then
we call that a False Negative…
Predicted
Yes No
Agreed!
139
The Confusion Matrix: Details Part 2
When there are only two possible
2 outcomes, like Yes and No…
Has Heart
Disease 142 22
Actual
Gore
No No Yes No
Police
Cool as
No Yes Yes Yes
Ice Predicted
… … … … …
Troll 2 Gore Police Cool As Ice
Troll 2 142 22 ..
In general, the
4 size of the
Actual
Bam.
140
fi
The Confusion Matrix: Details Part 3
Unfortunately, there’s no standard for how a
5 Confusion Matrix is oriented. In many cases, the
rows re ect the actual, or known, classi cations
and the columns represent the predictions.
Predicted
Has Heart Does Not Have
Disease Heart Disease
Actual
Has Heart Does Not Have
So, make sure you read Disease Heart Disease
the labels before you
interpret a Confusion Has Heart True False
Predicted
141
fl
fi
fl
fi
The Confusion Matrix: Details Part 4
Gentle Reminder:
Predicted
Lastly, in the original example, when we True Positive False
6 compared the Confusion Matrices for Yes No Negative
Naive Bayes and Logistic Regression… TP FN
Actual
Yes
No FP TN
Predicted True
Naive Bayes Yes No False Positive Negative
142 22
Actual
Yes
…because both matrices contained the
No 29 110
same number of False Negatives (22
each) and False Positives (29 each)…
Logistic Predicted
Regression …all we needed to do was compare the True
Yes No
Positives (142 vs. 137) to quantify how much better
137 22
Actual
Actual
Yes Yes
True Positives (142 vs.
No 22 110 No 20 112 139) and False
Negatives (29 vs. 32).
And we still see that Logistic Regression is better at
predicting people without Heart Disease, but quantifying
how much better has to take True Negatives (110 vs. 112)
and False Positives (22 vs. 20) into account.
Actual
Yes
Actual
Sensitivity = = = 0.83 Yes
TP + FN 142 + 29
No 22 110
…which means that 83% of the people with
Heart Disease were correctly classi ed.
BAM!!!
When we want to quantify how well an algorithm (like Logistic
2
Regression) correctly classi es the actual Negatives, in this case, the
known people without Heart Disease, we calculate Speci city, which is
the percentage of the actual Negatives that were correctly classi ed.
For example, using the Heart
True Negatives Disease data and Confusion
Speci city =
True Negatives + False Positives Matrix, the Speci city for
Logistic Regression is 0.85…
Logistic Predicted
Predicted
Regression Yes No
Yes No
139 32
Actual
Yes
TP FN
Actual
Yes
No 20 112
No FP TN
TN 112
Speci city = = = 0.85
TN + FP 112 + 20
Gentle Reminder:
Yes
No FP TN
True DOUBLE about
Now let’s talk
Precision
False Positive Negative
BAM!!! and Recall.
143
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Precision and Recall: Main Ideas
Precision is another metric that can summarize a
1 Confusion Matrix. It tells us the percentage of the
predicted Positive results (so, both True and False
Positives) that were correctly classi ed.
For example, using the Heart
True Positives Disease data and Confusion
Precision = Matrix, the Precision for
True Positives + False Positives
Naive Bayes is 0.87…
Predicted
Predicted Naive Bayes
Yes No
Yes No
142 29
Actual
Yes
TP FN
Actual
Yes
No 22 110
No FP TN
Yes
No FP TN
Recall is just another name for True
2 Sensitivity, which is the percentage False Positive Negative
of the actual Positives that were
correctly classi ed. Why do we Predicted
need two di erent names for the
Yes No
same thing? I don’t know.
TP FN
Actual
Yes
True Positives
Recall = Sensitivity = No FP TN
True Positives + False Negatives
F# minor G F# minor G
#
#
Pre- ci-sion is some-thing dif-fer-ent It’s the percentage of pre-dic- ted po-si -tives
A G A G A
#
Now go back
# to the top and
repeat!!!
correctly predicted and recall gets us back to the start it’s the same as…
145
True Positive Rate and False Positive
Rate: Main Ideas
The True Positive Rate is the same thing as Recall, which is
the same thing as Sensitivity. I kid you not. We have 3
1 names for the exact same thing: the percentage of actual
Positives that were correctly classi ed. TRIPLE UGH!!!
True Positives
True Positive Rate = Recall = Sensitivity =
True Positives + False Negatives
Predicted
Yes No
TP FN
Actual
Yes
No FP TN
Gentle Reminder:
The False Positive Rate tells Predicted
2 you the percentage of actual
True Positive
Yes No
False
Negative
Negatives that were incorrectly
TP FN
Actual
False Positives
False Positive Rate =
False Positives + True Negatives
Predicted
Okay, now we’ve got some
Yes
TP
No
FN
3 fancy terminology that we can
use to summarize individual
Actual
Yes
FP confusion matrices. BAM?
No TN
To be honest, even with the Silly Song, I nd it
hard to remember all of this fancy terminology and
so don’t use it that much on its own. However, it’s
a stepping stone to something that I use all the
time and nd super useful: ROC and Precision
Recall Graphs, so let’s learn about those!
1 = Loves Troll 2
Probability Predicted
that someone Yes No
loves Troll 2
4 1
Actual
Yes
No 1 3
Does Not
0=
Love Troll 2
Popcorn (g)
147
fi
fi
ROC: Main Ideas Part 2
Now let’s talk about what happens when we For example, if it was
3 use a di erent classi cation threshold for
deciding if someone Loves Troll 2 or not.
super important to
correctly classify
every single person
who Loves Troll 2,
1 = Loves Troll 2
we could set the
threshold to 0.01.
Probability
that someone
loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
NOTE: If the idea of using a classi cation threshold other than 0.5 is
blowing your mind, imagine we’re trying to classify people with the Ebola
virus. In this case, it’s absolutely essential that we correctly identify every
single person with the virus to minimize the risk of an outbreak. And that
means lowering the threshold, even if it results in more False Positives.
1 = Loves Troll 2
Does Not
0=
Love Troll 2 Predicted
Popcorn (g) Yes No
5 0
Actual
Yes
No 2 2
1 = Loves Troll 2
Probability
that someone
loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
Gentle Reminder:
Predicted
True Positive False
Yes No Negative
TP FN
Actual
Yes
Does Not
…and we would end up
0= with this Confusion
Love Troll 2
Matrix.
Popcorn (g)
Predicted
Yes No
3 2
Actual
Yes
No 0 4
149
fi
fi
fi
ROC: Main Ideas Part 4
We could also set the classi cation
7 threshold to 0, and classify everyone
…and that would give us
this Confusion Matrix.
as someone who Loves Troll 2…
Predicted
1 = Loves Troll 2
Yes No
5 0
Actual
Probability Yes
that someone
loves Troll 2
No 4 0
Does Not
0=
Love Troll 2
Popcorn (g)
Predicted
1 = Loves Troll 2
Yes No
0 5
Actual
Probability Yes
that someone
loves Troll 2
No 0 4
Does Not
0=
Love Troll 2
Popcorn (g)
150
fi
fi
ROC: Main Ideas Part 5
Ultimately, we can try any classi cation
9 threshold from 0 to 1…
Predicted Predicted
Yes No Yes No
5 0 5 0
Actual
Actual
Yes Yes
No 4 0 No 2 2
Threshold = 0 Threshold = 0.0645
Predicted Predicted
Yes No Yes No
5 0 5 0
Actual
Actual
Yes Yes
No 3 1 No 1 3
Threshold = 0.007 Threshold = 0.195
Predicted Predicted
Yes No Yes No
4 1 3 2
Actual
Actual
Yes Yes
No 1 3 No 1 3
Threshold = 0.5 Threshold = 0.87
UGH!!! Trying to nd the ideal
classi cation threshold among
Predicted Predicted all of these technicolor
Yes No Yes No Confusion Matrices is tedious
and annoying. Wouldn’t it be
3 2 2 3
Actual
Actual
Yes Yes
awesome if we could
No 0 4 No 0 4 consolidate them into one easy-
to-interpret graph?
Threshold = 0.95 Threshold = 0.965
1
Predicted
Yes No
4
Predicted
Yes
0
No YES!!!
Actual
Yes 5
Actual
Yes
Well, we’re in luck because
No 0 4 No 0 4 that’s what ROC graphs do!!!
Threshold = 0.975 Threshold = 1
151
fi
ff
fi
fi
ROC: Main Ideas Part 6
ROC graphs are super helpful
10 when we need to identify a good
ROC stands for Receiver Operating
Characteristic, and the name comes from the
classi cation threshold because graphs drawn during World War II that summarized
they summarize how well each how well radar operators correctly identi ed
threshold performed in terms of airplanes in radar signals.
the True Positive Rate and the
False Positive Rate.
152
fi
fi
fi
fi
fi
fi
fi
fi
Gentle Reminder:
ROC: Details Part 1 True Positive Predicted
False
Yes No Negative
To get a better sense of how an
1 TP FN
Actual
ROC graph works, let’s draw one Yes
from start to nish. We’ll start by No FP TN
using a classi cation threshold, 1, True
that classi es everyone as someone False Positive Negative
who Does Not Love Troll 2…
1 = Loves Troll 2
…and when the
Probability classi cation threshold
that someone is set to 1, we end up
loves Troll 2 with this Confusion
Matrix.
Does Not
0=
Love Troll 2 Predicted
Popcorn (g) Yes No
0 5
Actual
Yes
True Positives
True Positive Rate =
True Positives + False Negatives
0
= =0
0+5
3 …and the False
Positive Rate…
False Positives
False Positive Rate =
False Positives + True Negatives
0
1 = =0
0+4
…and then we can
True Positive
4 plot that point, (0, 0),
on the ROC graph.
Rate
(or Sensitivity
or Recall)
0
False Positive Rate
0 1
(or 1 - Speci city)
153
fi
fi
fi
fi
fi
Gentle Reminder:
ROC: Details Part 2 True Positive Predicted
False
Yes No Negative
Now let’s lower the
5 TP FN
Actual
classi cation threshold Yes
to 0.975… No FP TN
True
…which is just enough to False Positive Negative
classify one person as
someone who Loves Troll 2…
1 = Loves Troll 2
…and everyone else is
classi ed as someone
Probability
who Does Not Love
that someone
loves Troll 2 Troll 2…
1
= = 0.2
1+4
7 …and the False
Positive Rate…
False Positives
False Positive Rate =
False Positives + True Negatives
0
1 = =0
0+4
…and then we can
True Positive
8 plot that point, (0, 0.2),
on the ROC graph…
Rate
(or Sensitivity
or Recall)
Actual
classi cation threshold Yes
to 0.965… No FP TN
True
…which is just enough to False Positive Negative
classify 2 people as
people who Love Troll 2…
1 = Loves Troll 2
…and everyone else is
classi ed as someone
Probability
who Does Not Love
that someone
loves Troll 2 Troll 2…
2
= = 0.4
2+3
11 …and the False
Positive Rate…
False Positives
False Positive Rate =
False Positives + True Negatives
0
1 = =0
0+4
…and then we can
True Positive
12 plot that point, (0, 0.4),
on the ROC graph…
Rate
(or Sensitivity
or Recall)
1 Threshold = 0
1
0 0
Predicted
Threshold = 0.0645
Yes No 1
3 2
Actual
Yes
No 0 4 1
Threshold = 0.95 0
True
Positive Threshold = 0.007
Rate 1
(or
Sensitivity
or Recall)
1 0
0
False Positive Rate
0 (or 1 - Speci city) 1 Threshold = 0.195
1
0
3 2
Actual
Yes
No 1 3
Threshold = 0.87 0
156
fi
fi
fi
ROC: Details Part 5
After we nish plotting the points from …and add a diagonal line that tells
14 each possible Confusion Matrix, we
usually connect the dots…
us when the True Positive Rate =
False Positive Rate.
True Positive
Rate
(or Sensitivity
or Recall)
0
False Positive Rate
0 1
(or 1 - Speci city)
1
BAM!!!
ROC graphs are great for
True Positive
Rate 16 selecting an optimal
(or Sensitivity classi cation threshold for a
or Recall) model. But what if we want to
compare how one model
performs vs. another? This is
where the AUC, which stands
0 for Area Under the Curve,
comes in handy. So read on!!!
False Positive Rate
0 1
(or 1 - Speci city)
157
fi
fi
fi
fi
fi
fi
fi
AUC: Main Ideas
However, if we wanted to
True True compare a bunch of
Positive Positive models, this would be just
Rate Rate as tedious as comparing a
bunch of Confusion
Matrices.
0 0 UGH!!!
0 False Positive Rate 1 0 False Positive Rate 1
158
AUC: FAQ
AUC = 0.85
BAM!!!
How do you calculate the AUC?
1
Area = 0.25 x 0.2 = 0.025
2
1
Area = 0.25 x 0.2 = 0.025
2
0.8
0.6
True Area = 0.5 x 1 = 0.5
Positive
Rate
0.4
0.2
159
BUT WAIT!!!
THERE’S MORE!!!
160
Precision Recall Graphs: Main Ideas Part 1
An ROC graph uses the False Positive Rate on the x-axis, and
1 this is ne when the data are balanced, which in this case means
there are similar numbers of people who Love Troll 2 and who
Do Not Love Troll 2.
1
1 = Loves Troll 2
0 0
However, when the data are imbalanced, and we have way more
2 people who Do Not Love Troll 2 (which wouldn’t be surprising
since it consistently wins “worst movie ever” awards)…
Popcorn (g)
161
fi
Precision Recall Graphs: Main Ideas Part 2
A Precision Recall graph simply replaces the False Positive
3 Rate on the x-axis with Precision and renames the y-axis
Recall since Recall is the same thing as the True Positive Rate.
Popcorn (g)
1 1
Recall
True (or True
Positive Positive
Rate Rate
or
Sensitivity)
0 0
Yes
Actual
Yes
No FP TN No 1 200
True
False Positive Negative Threshold = 0.5
Bam.
162
fi
fi
Hey Norm, now that we
understand how to summarize
how well a model performs using
Confusion Matrices and ROC
graphs, what should we learn
about next?
163
Chapter 09
Preventing
Overfitting with
Regularization!!!
Regularization: Main Ideas
The Problem: The fancier and more exible
1 a machine learning method is, the easier it
is to over t the Training Data.
A Solution: One very common way to deal with over tting the Training
2 Data is to use a collection of techniques called Regularization. Essentially,
Regularization reduces how sensitive the model is to the Training Data.
Height Height
BAM!!!
NOTE: In this chapter, we’ll learn about the two main types of
Regularization, Ridge and Lasso, in the context of Linear
Regression, but they can be used with almost any machine
learning algorithm to improve performance.
165
fi
fi
fi
fi
fl
fi
Ridge/Squared/L2 Regularization:
Details Part 1
Let’s imagine we measured …and then we split
1 the Height and Weight of 5 2 those data into
di erent people… Training Data…
Height Height
Weight Weight
Height
Height
Weight
Weight
…where the Greek
Height character λ, lambda,
is a positive number
To get a better idea of how that determines how
9 the Ridge Penalty works,
let’s plug in some numbers.
strong an e ect Ridge
Regularization has
We’ll start with the line that on the new line.
Weight
ts the Training Data
perfectly…
10 Now, if we’re
using Ridge
Regularization, …and, we end up
we want to with 1.69 as the
Height minimize this score for the line
equation. that ts the
Training Data
Height = 0.4 + 1.3 x Weight perfectly.
167
fi
fi
ff
fi
fi
fi
Ridge/Squared/L2 Regularization:
Details Part 3
Now let’s calculate the Ridge Score for …so we plug in the SSR, 0.4,
11 the new line that doesn’t t the Training
Data as well. It has an SSR = 0.4…
12 the value for λ (lambda), which
for still is 1, and the slope, 0.6,
into the Ridge Penalty.
And when we
do the math,
we get 0.76.
Height = 1.2 + 0.6 x Weight
Height
Weight
BAM!!!
However, you may
be wondering how
we found the new
line, and that
Height means it’s time to
talk a little bit more
about λ (lambda).
Weight
SSR + λ x slope2
168
fi
fi
fi
fi
Ridge/Squared/L2 Regularization:
Details Part 4
15 When λ = 0… …and, as a result, the new line, derived
from Ridge Regularization, is no di erent
from a line that minimizes the SSR.
…then the whole
SSR + λ x slope2
Ridge Penalty is
also 0…
= SSR + 0 x slope2
λ=0
= SSR + 0 Height
…and that means we’ll
= SSR only minimize the SSR,
so it’s as if we’re not
using Ridge
Regularization… Weight
When we
17 increase λ to 2, Height
the slope gets λ=2
even smaller…
Height
Weight
Weight
λ=3
Height
Height So, how do we
pick a good
value for λ?
Weight
Weight
169
fi
ff
Ridge/Squared/L2 Regularization:
Details Part 5
Unfortunately, there’s no good way to know in advance what the best
20 value for λ will be, so we just pick a bunch of potential values, including
0, and see how well each one performs using Cross Validation. BAM!!!
λ=0 vs. λ=1 vs. λ=2
SSR + λ x slope2
Weight
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
BAM!!!
171
ff
Ridge/Squared/L2 Regularization: FAQ
All of the examples showed how As long as you try setting λ to 0, when
increasing λ, and thus decreasing the you’re searching for the best value for
slope, made things better, but what if we λ, in theory, Ridge Regularization can
need to increase the slope? Can Ridge never perform worse than simply
Regularization ever make things worse? nding the line that minimizes the SSR.
When we only have one slope to optimize, one way to nd the line
that minimizes the SSR + the Ridge Penalty is to use Gradient
Descent. In this case, we want to nd the optimal y-axis intercept
and slope, so we take the derivative with respect to the intercept…
d
( SSR + λ x slope2 )
d intercept
= -2 x ( Height - (intercept + slope x Weight) )
d
( SSR + λ x slope2 )
d slope
= -2 x Weight( Height - (intercept + slope x Weight) )
+ 2 x λ x slope
173
fi
fi
fi
Lasso/Absolute Value/L1 Regularization:
Details Part 2
The big di erence between Ridge and Lasso Regularization is that Ridge
5 Regularization can only shrink the parameters to be asymptotically close to
0. In contrast, Lasso Regularization can shrink parameters all the way to 0.
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
…then regardless of how useless the In contrast, if Airspeed of a Swallow was totally
variable Airspeed of a Swallow is for useless, then Lasso Regularization can make
making predictions, Ridge slopea = 0, resulting in a simpler model that no
Regularization will never get slopea = 0. longer includes Airspeed of a Swallow.
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
174
ff
NOTE: Ridge and Lasso
Regularization are frequently
combined to get the best of
both worlds.
DOUBLE BAM!!!
175
Ridge vs. Lasso Regularization: Details Part 1
The critical thing to know about Ridge vs. Lasso Regularization is that Ridge works
1 better when most of the variables are useful and Lasso works better when a lot of
the variables are useless, and Ridge and Lasso can be combined to get the best of
both worlds. That said, people frequently ask why only Lasso can set parameter
values (e.g, the slope) to 0 and Ridge cannot. What follows is an illustration of this
di erence, so if you’re interested, read on!!!
Height
Weight
15
15
SSR
RSS
+ 10
10
λ x slope2
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
176
ff
fi
fi
Ridge vs. Lasso Regularization: Details Part 2
Height
Ultimately, we can plot the
SSR + λ x slope2, with λ = 0, as
a function of the slope, and we
get this blue curve.
Weight
20
20
SSR
+ 15
15
λ x slope2
RSS
10
10
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
177
Ridge vs. Lasso Regularization: Details Part 3
Now that we have this blue curve, we can see
5 that the value for the slope that minimizes the
Ridge Score is just over 0.4…
…which corresponds to
this blue line.
20
20
SSR
+ 15
15
λ x slope2 Height
RSS
10
10
λ=0
5
5
Weight
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
λ x slope2
λ = 10
RSS
10
10
λ=0
Height
5
5
…which
corresponds this Ridge Regression makes me
orange line. think of being at the top of a cold
mountain. Brrr!! And Lasso
Regression makes me think
about cowboys. Giddy up!
Height
Weight
178
ff
Ridge vs. Lasso Regularization: Details Part 4
When we compare the blue
7 line, which has the optimal
slope when λ = 0…
20
20
λ x slope2 when λ = 0.
λ = 10
RSS
10
10
λ=0
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
20 λ = 20
20
10
10
λ x slope2
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
179
ff
Ridge vs. Lasso Regularization: Details Part 5
20
20
SSR
Weight
+ 15
15
λ = 40
λ x slope2
RSS
10
10
BAM!!!
Now let’s do the same thing using the
Lasso Penalty, and calculate
SSR + λ x |slope|!!!
−0.2 0
0.0 0.20.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
180
ff
ff
ff
Ridge vs. Lasso Regularization: Details Part 6
Now when we calculate the SSR + λ x |slope| and let
13 λ = 10 for a bunch of di erent values for the slope,
we get this orange shape, and the lowest point
corresponds to a slope just lower than 0.4.
20
20
SSR
+ 15 λ = 10 However, unlike before, this
15
10
10
5
5
−0.2 0
0.0 0.20.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
20
20
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
When we calculate the SSR + λ x |slope|
15
Slope Values
Intercept
10
10
BAM!!!
5
5
Height
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
Weight
181
ff
ff
ff
Ridge vs. Lasso Regularization: Details Part 7
In summary, when we increase the Ridge,
16 Squared, or L2 Penalty the optimal value for the
slope shifts toward 0, but we retain a nice parabola
shape, and the optimal slope is never 0 itself.
λ x slope2
λ = 10
RSS
10
10
20 λ = 40
20
λ=0
λ = 400
5 SSR λ = 20
5
+ 15
15
λ x slope2
λ = 10
RSS
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
10
10
Slope Values
Intercept
λ=0
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
20 TRIPLE
20
λ = 40
BAM!!!
SSR λ = 20
+ 15
15
λ x |slope| λ = 10
RSS
10
10
λ=0
5
5
182
Chapter 10
Decision Trees!!!
Classification and Regression Trees:
Main Ideas
186
fi
Classification Trees: Main Ideas
The Problem: We have a mixture of
1 discrete and continuous data…
1 = Loves Troll 2
A Solution: A
2 Classi cation Tree, Probability
that someone
which can handle all loves Troll 2
types of data, all types of
relationships among the
Does Not
independent variables (the 0=
Love Troll 2
data we’re using to make
predictions, like Loves Soda Age
and Age), and all kinds of
relationships with the
dependent variable (the thing
we want to predict, which in
this case is Loves Troll 2).
187
fi
fi
fi
Building a Classification Tree: Step-by-Step
Given this Training Dataset, we want to build a
1 Classi cation Tree that uses Loves Popcorn,
Loves Soda, and Age to predict whether or not
someone will love Troll 2.
Loves Popcorn
…by making a super
Yes No simple tree that only
asks if someone loves
Loves Troll 2 Loves Troll 2 Popcorn and running
the data down it.
Yes No Yes No
Loves Popcorn
Yes No
BAM!!!
Loves Troll 2 Loves Troll 2
Yes No Yes No Loves Loves Loves
Age
Popcorn Soda Troll 2
1 3 2 1
Yes Yes 7 No
Yes No 12 No
No Yes 18 Yes
… … … …
Then we do the
8 same thing for
Loves Soda.
Loves Soda
Yes No
189
Building a Classification Tree: Step-by-Step
…we see that Loves Troll 2 Loves Troll 2 Loves Troll 2 Loves Troll 2
these three Yes No Yes No Yes No Yes No
Leaves contain
mixtures of 1 3 2 1 3 1 0 3
people who love
and do not love
Troll 2.
In contrast, this Leaf only
contains people who do
not love Troll 2.
TERMINOLOGY ALERT!!!
Leaves that contain mixtures of
classi cations are called Impure.
Because both Leaves in the …it seems like Loves Soda does a better
job classifying who loves and does not
10 Loves Popcorn tree are Impure
and only one Leaf in the Loves
love Troll 2, but it would be nice if we
could quantify the di erences between
Soda tree is Impure… Loves Popcorn and Loves Soda.
190
fi
ff
Building a Classification Tree: Step-by-Step
Gini
13 Impurity = 1 - (the probability of “yes”)2 - (the probability of “no”)2
for a Leaf
(The total for the Leaf( (The total for the Leaf(
For the Leaf on the The number for Yes 2 The number for No 2
14 left, when we plug the =1- -
numbers, Yes = 1,
No = 3, and Total = 1
+ 3, into the equation
(1 + 3( (1 + 3 (
1 2 3 2
for Gini Impurity, we =1- - = 0.375
get 0.375.
Gini
Impurity = 1 - (the probability of “yes”)2 - (the probability of “no”)2
for a Leaf
( 2 + 1( ( 2 + 1(
2 2 1 2
=1- - = 0.444
191
fi
Building a Classification Tree: Step-by-Step
Total Gini
17 Impurity
= weighted average of Gini Impurities for the Leaves
Impurity ( 4 + 3 ( (4 + 3(
Total Gini 4 3
= 0.375 + 0.444 = 0.405
192
ff
Building a Classification Tree: Step-by-Step
Now that we’ve calculated the
19 Gini Impurity for Loves
Loves Popcorn
Yes No
Popcorn, 0.405…
The lower Gini Impurity for Loves Soda, 0.214, con rms what we
20 suspected earlier, that Loves Soda does a better job classifying
people who love and do not love Troll 2. However, now that we’ve
quanti ed the di erence, we no longer have to rely on intuition. Bam!
Lastly, we calculate the Gini For example, the rst average Age is
23 Impurity values for each
average Age.
9.5, so we use 9.5 as the threshold for
splitting the rows into 2 leaves…
Age < 15
Yes No
194
fi
Building a Classification Tree: Step-by-Step
Now remember: our rst goal was to
???
25 determine whether we should ask about
Loves Popcorn, Loves Soda, or Age at
the very top of the tree…
Loves Soda
Yes No
Age < 15
Yes No
BAM!!! 195
fi
Building a Classification Tree: Step-by-Step
With Loves Soda at the top of the …and the 3 people who do not love
28 tree, the 4 people who love Soda,
including 3 who love Troll 2 and 1 who
29 Soda, all of whom do not love Troll
2, go to the Node on the right.
does not, go to the Node on the left…
Loves Soda
Now, because the
30 Node on the left is
Yes No
196
Building a Classification Tree: Step-by-Step
Now remember, earlier we put Loves Soda in the Root because
32 splitting every person in the Training Data based on whether or
not they love Soda gave us the lowest Gini Impurity. So, the 4
people who love Soda went to the left Node…
Loves Loves
Age
Loves …and the 3 people who
Popcorn Soda Troll 2
33 do not love Soda, all of
whom do not love Troll
Yes Yes 7 No
2, went to the Node on
Yes No 12 No the right.
No Yes 18 Yes Loves Soda
No Yes 35 Yes Yes No
Yes Yes 38 Yes
Loves Troll 2 Loves Troll 2
Yes No 50 No
Yes No Yes No
No No 83 No
3 1 0 3
Loves Popcorn
35 in the lowest Gini Impurity, 0, we
add it to the tree.
Yes No
And the new Nodes
are Leaves because
Loves Troll 2 Loves Troll 2
neither is Impure.
Yes No Yes No
1 1 2 0 Loves Soda
Gini Impurity = 0.25
Yes Yes 7 No
Yes No 12 No Age < 12.5 Loves Troll 2
No Yes 18 Yes Yes No
No Yes 35 Yes 0 3
Yes Yes 38 Yes Loves Troll 2 Loves Troll 2
Yes No Yes No
Yes No 50 No
0 1 3 0
No No 83 No
BAM?
Not yet, there are
still a few things we
need to talk about.
fi
fi
fi
Building a Classification Tree: Step-by-Step
When we built this tree, only …and because so few people in the Training
39 one person in the Training
Data made it to this Leaf…
Data made it to that Leaf, it’s hard to have
con dence that the tree will do a great job
making predictions with future data.
However, in practice, there are two main
Loves Loves Loves
ways to deal with this type of problem.
Age
Popcorn Soda Troll 2
Loves Soda
Yes Yes 7 No
Yes No 12 No
No Yes 18 Yes Age < 12.5 Loves Troll 2
Yes No
No Yes 35 Yes
0 3
Yes Yes 38 Yes
Loves Troll 2 Loves Troll 2
Yes No 50 No
Yes No Yes No
No No 83 No
0 1 3 0
BAM!!!
we try a bunch, use Cross
Validation, and pick the
number that works best.
Now let’s summarize how to
build a Classi cation Tree.
199
fi
fi
Building a Classification Tree: Summary
As a result, the 4 people
From the entire Training Dataset, we
1 used Gini Impurity to select Loves
Soda for the Root of the tree.
who love Soda went to the
left and the 3 people who
do not love Soda went to
the right.
Loves Loves Loves BAM!
Age
Popcorn Soda Troll 2
Yes Yes 7 No
Yes No 12 No
No Yes 18 Yes Loves Soda
No Yes 35 Yes
Yes Yes 38 Yes
Age < 12.5 Loves Troll 2
Yes No 50 No
Yes No
No No 83 No 0 3
Loves Troll 2 Loves Troll 2
Then we used the 4 people Yes No Yes No
2 who love Soda, which were a
mixture of people who do and 0 1 3 0
do not love Troll 2, to calculate
Gini Impurities and selected
Age < 12.5 for the next Node.
Double
BAM!!
Loves Loves Loves
Age
Popcorn Soda Troll 2 Then we selected
Yes Yes 7 No
3 output values for each
Leaf by picking the
Yes No 12 No categories with the
No Yes 18 Yes highest counts.
No Yes 35 Yes
Yes Yes 38 Yes Loves Soda
Yes No 50 No Yes No
No No 83 No
Age < 12.5 Does Not
Love Troll 2
Yes No
TRIPLE Loves Troll 2
Does Not
BAM!! Love Troll 2
Regression Trees
201
Regression Trees: Main Ideas Part 1
The Problem: We have this Training …and tting a straight line to the data
1 Dataset that consists of Drug
E ectiveness for di erent Doses…
would result in terrible predictions
because there are clusters of ine ective
Doses that surround the e ective
100 Doses.
Drug 75
E ectiveness 100
% 50
Drug 75
25 E ectiveness
% 50
10 20 30 40 25
Drug Dose
10 20 30 40
Drug Dose
Just like Classi cation Trees, Regression Trees are relatively easy to interpret
and use. In this example, if you’re given a new Dose and want to know how
E ective it will be, you start at the top and ask if the Dose is < 14.5…
Dose < 14.5 …if it is, then the drug will be 4.2%
Yes No E ective. If the Dose is not < 14.5,
then ask if the Dose ≥ 29…
4.2% E ective Dose ≥ 29
Yes No …if so, then the drug will be
2.5% E ective. If not, ask if the
Dose ≥ 23.5…
2.5% E ective Dose ≥ 23.5
Yes No …if so, then the drug will be
52.8% E ective. If not, the drug
52.8% E ective 100% E ective will be 100% E ective…
BAM!!!
202
ff
ff
ff
ff
ff
fi
fi
ff
ff
ff
ff
ff
ff
ff
fi
ff
ff
ff
ff
Regression Trees: Main Ideas Part 2
In this example, the Regression Tree makes good predictions because
3 each Leaf corresponds to a di erent cluster of points in the graph.
100
If we have a Dose < 14.5,
then the output from the
Drug 75
Regression Tree is the
E ectiveness
% 50 average E ectiveness from
these 6 measurements,
which is 4.2%.
25
10 20 30 40
Drug Dose
Dose < 14.5
Yes No
If the Dose is ≥ 29,
4 then the output is the
4.2% E ective Dose ≥ 29
average E ectiveness
from these 4 Yes No
measurements, 2.5%.
100 Dose ≥ 23.5
2.5% E ective
75 Yes No
25
10 20 30 40
5 29, then the output is the If the Dose is between 14.5 and
average E ectiveness from
these 5 measurements, 52.8%. 6 23.5, then the output is the
average E ectiveness from
these 4 measurements, 100%.
100
100
75
75
50
50
25
25
10 20 30 40
10 20 30 40
203
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
Building a Regression Tree: Step-by-Step
Given these Training Data, we Just like for Classi cation
1 want to build a Regression Tree
that uses Drug Dose to predict
2 Trees, the rst thing we do for
a Regression Tree is decide
Drug E ectiveness. what goes in the Root.
100
???
Drug 75
E ectiveness
% 50
25
10 20 30 40
Drug Dose
25 Dose < 3
Yes No
10 20 30 40
204
Drug Dose
ff
ff
ff
ff
ff
fi
fi
fi
Building a Regression Tree: Step-by-Step
For the one point with …the Regression
5 Dose < 3, which has
E ectiveness = 0…
Tree makes a pretty
good prediction, 0.
100
Dose < 3
Yes No
Drug 75
E ectiveness
% 50 Average = 0 Average = 38.8
25
In contrast, for this speci c
6 point, which has Dose > 3
10 20 30 40 and 100% E ectiveness…
Drug Dose
= 27,468.5
10 20 30 40
Dose
205
ff
ff
ff
ff
ff
ff
fi
Building a Regression Tree: Step-by-Step
Now we shift the Dose threshold to …and we build this super
8 be the average of the second and simple tree with Dose < 5 at
third measurements in the graph, 5… the Root.
100
Dose < 5
Yes No
75
50
9 Average = 0 Average = 41.1
75
10 20 30 40
Dose
25 30,000
15,000
10 20 30 40 SSR
Drug Dose
10 20 30 40
Dose
206
ff
ff
Building a Regression Tree: Step-by-Step
Then we just keep shifting the threshold to the
12 average of every pair of consecutive Doses,
30,000
create the tree, then calculate and plot the SSR.
75 Average = 0 Average = 47
10 20 30
Dose
50
25
13 After shifting the Dose threshold over
2 more steps, the SSR graph looks like this.
75 Average = 4 Average = 52
50 30,000
Then, after shifting the
14 Dose threshold 7 more
times, the SSR graph 25 15,000
looks like this. SSR
10 20 30 40 10 20 30
100 Drug Dose Dose
75
And nally, after shifting the
50
15 Dose threshold all the way to
the last pair of Doses…
25 100
75
…the SSR graph
10 20 30 40
Drug Dose looks like this.
50
Dose < 26
25
Average = 46 Average = 17
10 20 30 40
30,000 Drug Dose
30,000
15,000
Dose < 36
SSR
15,000 BAM!
10 20 30 Average = 39 Average = 0 SSR
Dose
10 20 30 40
Dose
207
fi
Building a Regression Tree: Step-by-Step
Looking at the SSRs for each Dose
16 threshold, we see that Dose < 14.5
had the smallest SSR…
…so Dose < 14.5 Dose < 14.5
30,000 will be the Root
of the tree…
15,000
SSR
10 20 30 40
Dose
50
75
25
50
…and, in theory, we
could subdivide these
25 6 measurements into
10 20 30 40
Drug Dose smaller groups by
repeating what we just
10 20 30 40 did, only this time
Drug Dose focusing only on these
In other words, just like before, 6 measurements.
18 we can average the rst two
Doses and use that value, 3, as
a cuto for splitting the 6
measurements with Dose < 14.5 …then we calculate the SSR for
into 2 groups… just those 6 measurements and
plot it on a graph.
100
7.25 14.5
10 20 30 40 Dose
Drug Dose
208
ff
fi
Building a Regression Tree: Step-by-Step
And after calculating the SSR for …and then we select the
19 each threshold for the 6
measurements with Dose < 14.5,
threshold that gives us the
lowest SSR, Dose < 11.5, for
we end up with this graph… the next Node in the tree.
100 BAM?
300
No. No bam.
75
150 Dose < 14.5
50 SSR
Dose < 11.5
25 7.25 14.5
Dose
Average = 1 Average = 20
10 20 30 40
Drug Dose
209
fi
fi
Building a Regression Tree: Step-by-Step
Now, because there are only 6 measurements …and because we require a
21 with Dose < 14.5, there are only 6
measurements in the Node on the left…
minimum of 7 measurements for
further subdivision, the Node on
the left will be a Leaf…
75 4.2% E ective
50
…and the output value for
the Leaf is the average
25 E ectiveness from the 6
measurements, 4.2%.
10 20
Drug Dose
30 40
100
BAM!!!
Drug 75
E ectiveness
Now we need to gure out % 50
10 20 30 40 50
Drug Dose
25
Dose < 14.5
10 20 30 40
4.2% E ective Drug Dose
210
ff
ff
ff
ff
ff
fi
fi
Building a Regression Tree: Step-by-Step
And because there are only 4 measurements
24 with Dose ≥ 29, there are only 4
measurements in the Node on the left…
100
25
100
75
Dose < 14.5
100
Dose < 14.5
75
4.2% E ective Dose ≥ 29
50
2.5% E ective Dose ≥ 23.5
25
10 20 30 40
Drug Dose
211
ff
ff
ff
ff
ff
ff
ff
ff
fi
ff
Building a Regression Tree: Step-by-Step
And since there are fewer …this will be the last split, because
26 than 7 measurements in none of the Leaves has more than
each of these two groups… 7 measurements in them.
Now, all we have to
do is calculate the
100 output values for
Dose < 14.5 the last 2 Leaves.
Drug 75
E ectiveness
% 4.2% E ective Dose ≥ 29
50
10 20 30 40
Drug Dose
100
Dose < 14.5
Drug 75
E ectiveness 4.2% E ective Dose ≥ 29
% 50
100
25
10 20 30 40
Drug Dose
213
Building a Regression Tree
With Multiple Features: Part 1 NOTE: Just like for
Classi cation
Trees, Regression
So far, we’ve built a Regression Tree Trees can use any
1 using a single predictor, Dose, to
predict Drug E ectiveness.
Dose
Drug
E ect type of variable to
make a prediction.
10 98 However, with
100
20 0 Regression Trees,
we always try to
Drug 75 35 6 predict a
E ectiveness continuous value.
5 44
% 50
etc… etc…
25
10 25 F 98
20 73 M 0
35 54 F 6
Drug
Dose Age Sex 5 12 M 44
E ect
etc… etc… etc… etc…
10 25 F 98
20 73 M 0
35 54 F 6 …and then we select the
5 12 M 44
threshold that gives us
the smallest SSR.
etc… etc… etc… etc…
However, instead of that
threshold instantly becoming
the Root, it only becomes a
candidate for the Root.
100
20 40 60 80
Age
Age > 50
Lastly, we ignore Dose and Age Yes No
5 and only use Sex to predict
E ectiveness… Average = 3 Average = 52
Drug
…and even though Sex only has
Dose Age Sex
E ect one threshold for splitting the
data, we still calculate the SSR,
10 25 F 98 just like before…
20 73 M 0
100
35 54 F 6
5 12 M 44 75
…and that becomes
etc… etc… etc… etc… the third candidate
50
for the Root.
25
Female Male
Sex = F
Yes No
Average = 52 Average = 40
215
ff
ff
ff
ff
Building a Regression Tree
With Multiple Features: Part 3
Sex = F
Yes No
SSR = 20,738
Age > 50
Average = 52 Average = 40
Yes No
Yes No Yes No
Drug Drug
Dose Age Sex Dose Age Sex
E ect E ect
10 25 F 98 10 25 F 98
20 73 M 0 20 73 M 0
35 54 F 6 35 54 F 6
5 12 M 44 5 12 M 44
Age > 50
Yes No
Yes No Yes No
Then we grow the tree just like before, except now for
8 each split we have 3 candidates, Dose, Age, and Sex,
and we select whichever gives us the lowest SSR…
217
ff
ff
ff
fi
Chapter 11
Support Vector
Classifiers and
Machines
(SVMs)!!!
Support Vector Machines (SVMs): Main Ideas
The Problem: We measured how much Popcorn people ate,
1 in grams, and we want to use that information to predict if
someone will love Troll 2. The red dots represent people who
do not love Troll 2. The blue dots represent people who do.
219
fi
fi
fi
fi
fi
Support Vector Classifiers: Details Part 1
= Loves Troll 2
If all the people who do …and all of the people
1 not love Troll 2 didn’t eat who love Troll 2 ate a lot
= Does Not Love Troll 2
much Popcorn… of Popcorn…
…then we could
Popcorn (g) pick this threshold
to classify people.
However, as you’ve probably already …then the threshold will classify them
2 guessed, this threshold is terrible as someone who loves Troll 2, even
because if someone new comes along though they’re much closer to the
and they ate this much Popcorn… people who do not love Troll 2.
So, this
threshold is
really bad.
Can we do
better?
YES!!!
Now the person we’d previously
One way to improve the classi ed as someone who loves
3 classi er is to use the
midpoint between edges
Troll 2 is classi ed as someone
who does not love Troll 2, which
of each group of people. makes more sense.
Can we do
better? YES!!!
220
fi
fi
fi
fi
Support Vector Classifiers: Details Part 2
One way to make a threshold for = Loves Troll 2
5 classi cation less sensitive to
outliers is to allow misclassi cations.
= Does Not Love Troll 2
Popcorn (g)
…and a Support
Vector Classi er
would be a straight
Soda (ml) line. In other words,
the Support Vector
Classi er would be
1-Dimensional.
Popcorn (g)
And if we measured
4 things, then the
Soda (ml) data would be 4-
Dimensional, which
we can’t draw, but
the Support Vector
Classi er would be
Age Popcorn (g) 3-Dimensional.
Etc., etc., etc.
222
fi
fi
fi
fi
fi
fi
Support Vector Classifiers: Details Part 4
= Loves Troll 2
= Does Not Love Troll 2
Double Ugh!!
These and these
are misclassi ed!!!
Triple Ugh!!!
These and these
are misclassi ed!!!
Can we do
better???
223
fi
fi
fi
fi
fi
fi
fi
Terminology Alert!!! Oh no! It’s the dreaded Terminology Alert!!!
= Loves Troll 2
The name Support Vector Classi er
1 comes from the fact that the points that
= Does Not Love Troll 2
de ne the threshold…
y-axis
Speci cally, we’ll make the
3 y-axis the square of the amount
of Popcorn each person ate. Popcorn Consumed (g)
4
Popcorn2 only ate 0.5 grams of Popcorn, their
x-axis coordinate is 0.5…
Popcorn2
Popcorn2
Popcorn2
Popcorn2
Popcorn2
Popcorn2
π
Poprcorn
Popcorn
4
Popcorn2
In other words, how do
we decide how to
transform the data?
To make the mathematics e cient,
Support Vector Machines use something
called Kernel Functions that we can use
Popcorn Consumed (g)
to systematically nd Support Vector
Classi ers in higher dimensions.
Two of the most popular Kernel
Functions are the Polynomial Kernel
and the Radial Kernel, also known as
the Radial Basis Function.
Popcorn2
In the Popcorn example, …and since we’re squaring the term, we can
3 we set r = 1/2 and d = 2… expand it to be the product of two terms…
(a x b + r )d = (a x b + 1 2 1 1
2
) = (a x b + 2
)(a x b + 2
)
1 1 1
= a2b2 + 2
ab + 2
ab + 4
…and then combine
the middle terms… NOTE: Dot Products sound fancier
than they are. A Dot Product is just…
1
= a2b2 + ab + 4
…and, just because it (a, a2, 1
2
) . (b, b2, 12 )
will make things look
better later, let’s ip
the order of the rst …the rst terms
two terms… 1 multiplied
1 ab + a2b2 + 4 together…
= ab + a2b2 + 4
…and nally, (a, a2, 1
2
) . (b, b2, 1
2
)
this polynomial
is equal to this
Dot Product! …plus the second
1terms multiplied
= (a, a2, 1
) . (b, b2, 1
) ab + a2b2 + 4 together…
2 2
(a, a2, 1
2
) . (b, b2, 1
2
)
Popcorn2
Popcorn2
Popcorn Consumed (g)
… and the third terms However, since they are 1/2 for both
7 are their z-axis
coordinates.
points (and every pair of points we
select), we can ignore them.
(a, a2, 1
2
) . (b, b2, 1
2
) (a, a2, 1
2
) . (b, b2, 1
2
)
BAM!!!
(a, a2, 1 2
) . (b, b2, 1
2
) Popcorn2
(a x b + r)d = (a x b + 1 2
2
) = (a, a2, 1
2
) . (b, b2, 1
2
)
…instead of the
high-dimensional
distance to nd the
Popcorn Consumed (g) …then we plug the optimal Support
values, 5 and 10, into Vector Classi er.
the Dot Product…
(a, a2, 1
2
) . (b, b2, 1
2
)
…do the …and we use this
math… number, 2550.25…
Umm…How on
earth does this
work?
The basic idea behind the Radial …we simply look to see how the closest
2 Kernel is that when we want to points from the Training Data are
classi ed. In this case, the closest points
classify a new person…
represent people who do not love Troll 2…
DOUBLE
Popcorn Consumed (g) BAM!!!
Believe it or not, the Radial Kernel
233
Chapter 12
Neural
Networks!!!
Neural Networks
Part One:
Understanding How
Neural Networks Work
235
Neural Networks: Main Ideas
The Problem: We have data that …and the s-shaped squiggle that
1 show the E ectiveness of a drug Logistic Regression uses would not
for di erent Doses…. make a good t. In this example, it
would misclassify the high Doses.
Drug 100
E ectiveness Drug 100
% 50 E ectiveness
% 50
0
Low Medium High 0
Drug Dose Low Medium High
Drug Dose
100
50
0
Low Medium High
Drug Dose
100 BAM!!!
50
0
Low Medium High
Drug Dose
236
fi
ff
ff
fi
fi
ff
ff
fi
fi
Terminology Alert!!! Anatomy (Oh no! It’s the dreaded
Terminology Alert!!!)
of a Neural Network
+ 1.29 × 2.28
237
ff
fi
fi
fl
Terminology Alert!!! Layers (Oh no! A Double
Terminology Alert!!!)
+ 1.29 x 2.28
238
ff
fi
Terminology Alert!!! Activation Functions
(Oh no! A TRIPLE TERMINOLOGY ALERT!!!)
+ 2.14 x -1.30
x -34.4
Sum + -0.58
x -2.52
+ 1.29 x 2.28
50 50 50
0 0 0
Low Medium High Low Medium High Low Medium High
Drug Dose Drug Dose Drug Dose
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
100
50
0
Low Medium High
Drug Dose
240
ff
fi
fi
fi
A Neural Network in Action: Step-by-Step
A lot of people say that Neural Networks are black boxes and that it’s
1 di cult to understand what they’re doing. Unfortunately, this is true for big,
fancy Neural Networks. However, the good news is that it’s not true for
simple ones. So, let’s walk through how this simple Neural Network works,
one step at a time, by plugging in Dose values from low to high and seeing
how it converts the Dose (input) into predicted E ectiveness (output).
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
0 0
Low Medium High 0 0.5 1
Drug Dose Drug Dose
Weight
241
ff
ff
ff
ffi
fi
ff
A Neural Network in Action: Step-by-Step
The rst thing we do is plug 1
4 the lowest Dose, 0, into the
Neural Network. E ectiveness
X0
0
0.5 1
Drug Dose
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
0 x -2.52
Sum + -0.58
x 2.28
+ 1.29
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4 X
0 x -2.52
Sum + -0.58
x 2.28
+ 1.29
Now, the connection from the (Dose x -34.4) + 2.14 = x-axis coordinate = 2.14
Input Node to the top Node
5 in the Hidden Layer
multiplies the Dose by -34.4
(0 x -34.4) + 2.14 = x-axis coordinate = 2.14
242
ff
ff
ff
fi
A Neural Network in Action: Step-by-Step
Then we plug the x-axis …and do the …and we get 2.25. So,
6 coordinate, 2.14 into the math, just like when the x-axis coordinate
SoftPlus Activation we saw earlier… is 2.14, the corresponding
Function… y-axis value is 2.25…
Dose
(Input) + 2.14
X E ectiveness
(Output)
x -1.30
x -34.4 X
0 x -2.52
Sum + -0.58
x 2.28
+ 1.29
…thus, when Dose = 0, the
output for the SoftPlus is
2.25. So, let’s put a blue
E ectiveness dot, corresponding to the
blue activation function,
7 Now let’s increase 1 onto the graph of the
the Dose to 0.2… original data at Dose = 0
with a y-axis value = 2.25.
0
0 0.5 1
Dose
(Input) + 2.14
x -34.4 X
0
0
X 0.5 1 0.2
Drug Dose
0
0 0.5 1
243
ff
ff
A Neural Network in Action: Step-by-Step
Now let’s increase …and calculate the new x-axis coordinate
8 the Dose to 0.4… by multiplying the Dose by -34.4 and
adding 2.14 to get -11.6…
244
A Neural Network in Action: Step-by-Step
Ultimately, the blue
10 dots result in this
blue curve…
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
0 Likewise, when we
0 0.5 1
multiply all of the y-axis …we end up ipping
12
Drug Dose
coordinates on the blue and stretching the
curve by -1.30… original blue curve to
…and when we get a new blue curve.
multiply 2.25 by -1.30,
we get a new y-axis
coordinate, -2.93.
1 1
1
0 0
0
0.5 1
Drug Dose
245
ff
ff
fl
A Neural Network in Action: Step-by-Step
1
First, we plugged
Dose values, ranging
from 0 to 1, into the
input…
0 X
0 0.5 1
Drug Dose
BAM!!!
Dose X E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4 X
Sum + -0.58
x -2.52
x 2.28
+ 1.29
E ectiveness
0
0 0.5 1
Drug Dose
246
ff
ff
fi
A Neural Network in Action: Step-by-Step
Now we run Dose values through
14 the connection to the bottom
Node in the Hidden Layer.
1
E ectiveness
0
0
X 0.5 1
Drug Dose
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + -0.58
x -2.52
+ 1.29 X x 2.28
X
…and then we stretch the
y-axis coordinates by
multiplying them by 2.28
The good news is that the only to get this orange curve.
15 di erence between what we just
did and what we’re doing now is
that now we multiply the Doses
by -2.52 and add 1.29…
1 1
E ectiveness E ectiveness
0 0
247
ff
ff
ff
ff
ff
A Neural Network in Action: Step-by-Step
So, now that we have
16 an orange curve…
…and a blue
curve…
…the next step in the
Neural Network is to
1 add their y-coordinates
together.
Dose E ectiveness
(Input) + 2.14 x -1.30 (Output)
x -34.4
Sum + -0.58
x -2.52
+ 1.29 x 2.28
0 0
248
ff
A Neural Network in Action: Step-by-Step
1 1 1 1
0 0 0 0
249
ff
A Neural Network in Action: Step-by-Step
Now that we have this green …we’re ready for the nal step,
19 squiggle from the blue and which is to add -0.58 to each y-axis
orange curves… value on the green squiggle.
0
Likewise, subtracting 0.58
21 from all of the other y-axis
coordinates on the green
squiggle shifts it down…
…and, ultimately, we
end up with a green 1
squiggle that ts the
1 Training Data.
DOUBLE
0 0
0 0.5 1
BAM!!!
250
fi
fi
A Neural Network in Action: Step-by-Step
Hooray!!! At long last, we see
22 the nal green squiggle…
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
0.5 x -2.52
Sum + -0.58 1.0
x 2.28
+ 1.29
0
0.5
Drug Dose
1 TRIPLE Now let’s learn
how to t a
Neural Network
BAM!!! to data!
251
ff
ff
ff
fi
fi
ff
ff
ff
ff
Neural Networks
Part Deux:
Fitting a Neural
Network to Data with
Backpropagation
252
Backpropagation: Main Ideas
The Problem: Just like for Linear Regression, Neural Networks have
1 parameters that we need to optimize to t a squiggle or bent shape to data.
How do we nd the optimal values for these parameters?
Dose E ectiveness
(Input) + ??? (Output)
x ???
x ???
Sum + ???
x ???
x ???
+ ???
Height
BAM!!!
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
253
ff
ff
fi
fi
fi
Terminology Alert!!! Weights and Biases
OH NO! It’s the
dreaded Terminology
In Neural Networks, Alert!!!
1 the parameters that
we multiply are called
Weights…
+ 2.14 Output
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
+ 2.14 Output
x -1.30
x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
254
Backpropagation: Details Part 1
In this example, let’s assume that we …except for the Final Bias. So the
1 already have the optimal values for
all of the Weights and Biases…
goal is to use Backpropagation to
optimize this Final Bias.
Dose E ectiveness
(Input) + 2.14 (Output)
x -1.30
x -34.4
Sum + ???
x -2.52
x 2.28
+ 1.29
…and when
Input + 2.14 x -1.30 we added the
1
x -34.4 y-axis
coordinates
x -2.52 0 from the blue
+ 1.29 x 2.28 and orange
curves, we got
this green
squiggle.
255
ff
Backpropagation: Details Part 2
Now that the Neural
3 Network has created
the green squiggle…
…which means
1 the green
x -1.30 Output squiggle doesn’t
t the Training
Sum + 0.0 Data very well.
x 2.28 0
0.5 1
256
fi
Backpropagation: Details Part 3
Now, just like we did for R2, Linear Regression, and
5 Regression Trees, we can quantify how well the green 1
squiggle ts all of the Training Data by calculating the
Sum of the Squared Residuals (SSR).
n
∑
0
SSR = (Observedi - Predictedi)2 0.5 1
Drug Dose
i=1
1
1
SSR = (0 - 0.57)2
0
+ (1 - 1.61)2 0.5 1
0
0.5 1 Drug Dose
Drug Dose
+ (0 - 0.58)2 = 1.0
x -1.30 Output
0 Sum + 0.0
0.5 1
Drug Dose x 2.28
257
fi
ff
ff
fi
Backpropagation: Details Part 4
Now we can compare the SSR for di erent
10 values for the Final Bias by plotting them on this
graph that has values for the Final Bias on the x-
axis and the corresponding SSR on the y-axis.
4
SSR
2
Thus, when the Final
Bias = 0, we can draw a
-0.5 0 0.5 pink dot at SSR = 1.
Final Bias
x -1.30 Output
Sum + 0.0
x 2.28
If we set the
11 Final Bias to
…then we shift the
green squiggle
…and we can calculate
the SSR and plot the
-0.25…
down a little bit… value on our graph.
4
SSR
x -1.30 2
Sum + -0.25
1
-0.5 0 0.5
x 2.28 Final Bias
E
0
0.5 1
Drug Dose
258
ff
ff
Backpropagation: Details Part 5
4
SSR
x -1.30 2
1
Sum + -0.5
-0.5 0 0.5
x 2.28 E Final Bias
0
0.5 1
Drug Dose
4
SSR
2
-0.5 0 0.5
Final Bias
259
ff
ff
fi
Backpropagation: Details Part 6
Remember, each Predicted
15 value in the SSR comes from
n
the green squiggle…
∑
SSR = (Observedi - Predictedi)2
i=1
0
0.5 1
Drug Dose
Sum + -0.25
x 2.28
∑
SSR = (Observedi - Predictedi)2
i=1
d SSR
4
d Final Bias …we can use The Chain
SSR
Rule to solve for the
2
derivative of the SSR with
respect to the Final Bias.
-0.5 0 0.5
Final Bias
260
Backpropagation: Details Part 7
The Chain Rule says that the
18 derivative of the SSR with
respect to the Final Bias…
d SSR
d Final Bias
=
d SSR
d Predicted
x
d Predicted
d Final Bias
∑
SSR = (Observedi - Predictedi)2
…multiplied by the derivative of
i=1 the Predicted values with
respect to the Final Bias.
BAM!!!
Psst!
If this doesn’t make any
sense and you need help
with The Chain Rule, see The Chain Rule is worth learning about
Appendix F. because shows up all over the place in machine
learning. Speci cally, anything that uses
Gradient Descent has a good chance of
involving The Chain Rule.
261
fi
Backpropagation: Details Part 8
n
d SSR d …by moving
d Predicted ∑
= (Observedi - Predictedi)2 the square to
d Predicted the front…
i=1
…and multiplying
everything by the
n derivative of the stu
inside the parentheses,
∑
= 2 x (Observedi - Predictedi) x -1
which is -1…
i=1
BAM!!!
n
d SSR We solved for the rst part of
∑
= -2 x (Observedi - Predictedi) the derivative. Now let’s solve
d Predicted for the second part.
i=1
262
fi
ff
fi
Backpropagation: Details Part 9
The second part, the derivative …is the derivative of the
21 of the Predicted values with
respect to the Final Bias…
green squiggle with
respect to the Final Bias…
d
= (blue curve + orange curve + Final Bias)
d Final Bias
x -1.30
Sum + -0.25
…thus, the derivatives of the
x 2.28 blue and orange curves with
respect to the Final Bias are
both 0 because they do not
depend on the Final Bias…
d
(blue curve + orange curve + Final Bias) = 0 + 0 + 1
d Final Bias
n
d SSR d Predicted = 1
∑
= -2 x (Observedi - Predictedi)
d Predicted d Final Bias
i=1
n
24
d SSR
∑
= -2 x (Observedi - Predictedi) x 1 At long last, we have the
d Final Bias derivative of the SSR with
i=1 respect to the Final Bias!!!
TRIPLE BAM!!!
In the next section, we’ll plug the derivative
25 into Gradient Descent and solve for the
optimal value for the Final Bias.
4
SSR
2
-0.5 0 0.5 4
Final Bias
SSR
2
-0.5 0 0.5 4
Final Bias
SSR
2
-0.5 0 0.5
264 Final Bias
Backpropagation: Step-by-Step
Now that we have the …which tells us how the …we can optimize
1 derivative of the SSR
with respect to the
SSR changes when we
change the Final Bias…
the Final Bias with
Gradient Descent.
Final Bias…
4
1
SSR
2
E ectiveness
0 -0.5 0 0.5
0.5 1 Final Bias
Drug Dose
0 + -2 x ( Observed2 - Predicted2 ) x 1
0.5 1
Drug Dose
+ -2 x ( Observed3 - Predicted3 ) x 1
d SSR
= -2 x ( 0 - Predicted1 ) x 1
d Final Bias
+ -2 x ( 1 - Predicted2 ) x 1
+ -2 x ( 0 - Predicted3 ) x 1
265
ff
ff
Backpropagation: Step-by-Step
Then we initialize the Final Bias to a random
3 value. In this case, we’ll set the Final Bias to 0.0.
X X
0
0.5 1 0
Drug Dose 0.5 1
Drug Dose
d SSR
= -2 x ( 0 - Predicted ) x 1
d Final Bias
+ -2 x ( 1 - Predicted ) x 1
+ -2 x ( 0 - Predicted ) x 1
= -2 x ( 0 - 0.57 ) x 1
+ -2 x ( 1 - 1.61 ) x 1
+ -2 x ( 0 - 0.58 ) x 1
266
Backpropagation: Step-by-Step
Now we evaluate the derivative at When we do …thus, when the Final
5 the current value for the Final Bias,
which, at this point, is 0.0.
the math, we
get 3.5…
Bias = 0, the slope of
this tangent line is 3.5.
d SSR
= -2 x ( 0 - 0.57 ) x 1
d Final Bias
+ -2 x ( 1 - 1.61 ) x 1 = 3.5
x -1.30 Output + -2 x ( 0 - 0.58 ) x 1
Sum + 0.0
4
x 2.28 SSR
2
-0.5 0 0.5
Final Bias
NOTE: In this
Then we calculate the Step example, we’ve Gentle Reminder:
Size with the standard set the Learning The magnitude of
6 equation for Gradient
Descent and get 0.35.
Rate to 0.1. the derivative is
proportional to
how big of a step
Step Size = Derivative x Learning Rate
we should take
toward the
Step Size = 3.5 x 0.1
minimum. The sign
(+/-) tells us in
Step Size = 0.35
what direction.
-0.5 0 0.5
Final Bias
267
Backpropagation: Step-by-Step
BAM!!!
x -1.30 Output
1 X
Sum + -0.35
x 2.28
X X
0
0.5 1
Drug Dose
Calculate the
c new value…
x -1.30 Output
Sum + -0.58
x 2.28
0
0.5 1 -0.5 0 0.5
Final Bias
Drug Dose
268
fi
Neural Networks: FAQ
And that means the green squiggle can do whatever it wants in between the
original 3 Doses, including making a strange bump that may or may not make good
predictions. This is something I think about when people talk about using Neural
Networks to drive cars. The Neural Network probably ts the Training Data really
well, but there’s no telling what it’s doing between points, and that means it will be
hard to predict what a self-driving car will do in new situations.
When we have Neural Networks, which are cool and super exible, why
would we ever want to use Logistic Regression, which is a lot less exible?
Neural Networks are cool, but deciding how many Hidden Layers to use and how
many Nodes to put in each Hidden Layer and even picking the best Activation
Function is a bit of an art form. In contrast, creating a model with Logistic
Regression is a science, and there’s no guesswork involved. This di erence means
that it can sometimes be easier to get Logistic Regression to make good predictions
than a Neural Network, which might require a lot of tweaking before it performs well.
Congratulations!!!
TRIPLE BAM!!!
We made it to the end of
an exciting book about
machine learning!!!
270
Appendices!!!
Pie Probabilities
272
Appendix A: Pie Probabilities
We’re in StatLand, where 70% of the In other words, 7 out of 10 people in
1 people prefer pumpkin pie and 30% of
the people prefer blueberry pie.
StatLand, or 7/10, prefer pumpkin pie,
and the remaining 3 out of 10 people,
or 3/10, prefer blueberry pie.
Pumpkin Pie Blueberry Pie
Of the 7/10ths of
the people who
3 say they prefer
7/10
7/10
3/10
BAM!!!
3/10 7/10
3/10
273
fl
ff
ff
fi
Appendix A: Pie Probabilities
Now let’s talk about what Speci cally, we’ll talk about the probability
4 happens when we ask a third
person which pie they prefer.
that the rst two people prefer pumpkin pie
and the third person prefers blueberry.
7/10
0.7 x 0.7 = 0.49 3/10
7/10 3/10
0.7 x 0.7 x 0.3 = 0.147
3/10 7/10
DOUBLE BAM!!!
3/10
274
fi
fi
fi
fi
fi
Appendix A: Pie Probabilities
7/10
7/10 3/10
7/10
3/10 7/10
0.3 x 0.7 = 0.21 3/10
3/10
0.3 x 0.7 x 0.7 = 0.147
7/10
7/10
7/10 3/10
0.7 x 0.3 x 0.7 = 0.147
0.7 x 0.3 = 0.21 3/10
3/10 7/10
3/10
TRIPLE
BAM!!!
275
fi
fi
Appendix B:
276
Appendix B: The Mean, Variance,
and Standard Deviation
Imagine we went to all 5,132 …but because there’s a lot
1 Spend-n-Save food stores and of overlap in the data, we
can also draw a Histogram
counted the number of green
apples that were for sale. We could of the measurements.
plot the number of green apples at
each store on this number line…
0 20 40
Number of apples 0 20 40
Number of apples
Sum of Measurements
Population Mean = μ =
Number of Measurements
2 + 8 + …. + 37
= = 20
4 Because the Population
Mean, μ, is 20, we center 5132
the Normal Curve over 20.
0 20 40
0 20 40
277
fi
fi
fi
Appendix B: The Mean, Variance,
and Standard Deviation
In other words, we want to calculate how The formula for calculating
6 the data are spread around the Population 7 the Population Variance is…
Mean (which, in this example, is 20).
Population ∑ (x - μ)2
Variance = n
μ = 20
…which is a pretty fancy-looking
formula, so let’s go through it
one piece at a time.
0 20 40
Population ∑ (x - μ)2 0 20 40
Variance = n
Population ∑ (x - μ)2
Variance = n
278
ff
fi
Appendix B: The Mean, Variance,
and Standard Deviation
Now that we know how to …when we do the Nope,
9 calculate the Population
Variance…
math, we get 100.
BAM?
not yet.
Because each term in the …the units for the result, …and that means
10 equation for Population 100, are Number of we can’t plot the
Population
Variance is squared… Apples Squared…
Variance on the
graph, since the
units on the x-axis
11 μ = 20
are not squared.
BAM!!!
Number of apples
Population Mean,
Variance, or Standard
Deviation. Read on to learn
what we do instead!
279
fi
Appendix B: The Mean, Variance,
and Standard Deviation
Instead of calculating Population Estimating the Population
14 Parameters, we estimate them
from a relatively small number of
15 Mean is super easy: we just
calculate the average of the
measurements. measurements we collected…
280
ff
ff
ff
ff
Appendix B: The Mean, Variance,
and Standard Deviation
NOTE: If we had
Now when we plug the
19 data into the equation for divided by n, instead
of n - 1, we would
the Estimated Variance…
x̄ = 17.6 have gotten 81.4,
which is a signi cant
underestimate of the
Estimated ∑ ( x - x̄ )2 0 20 40 true Population
Variance = n-1 Variance, 100.
0 20 40
282
Appendix C: Computer Commands for
Calculating Probabilities with
Continuous Probability Distributions
283
fi
fi
Appendix C: Computer Commands for
Calculating Probabilities with
Continuous Probability Distributions
Likewise, the Cumulative …and that makes sense
3 Distribution Function tells us
that the area under the curve to
because the total area under
the curve is 1, and half of the
the left of, and including, the area under the curve is to the
mean value, 155.7, is 0.5… left of the mean value.
285
Appendix C: Computer Commands for
Calculating Probabilities with
Continuous Probability Distributions
Gentle Reminder about the arguments for the NORMDIST() function:
normdist( x-axis value, mean, standard deviation, use CDF)
DOUBLE
BAM!!
TRIPLE
In the programming language called
8 R, we can get the same result using
the pnorm() function, which is just
like NORMDIST(), except we don’t
BAM!!!
need to specify that we want to use a
CDF.
286
Appendix D:
287
Appendix D: The Main Ideas of Derivatives
Imagine we One way to understand the
1 collected Test 2 3 relationship between Test
Scores and Time Scores and Time Spent
Spent Studying …and we t a Studying is to look at changes
from 3 people… straight line to in Test Scores relative to
the data. changes in Time Spend
Studying.
Test Score Test Score Test Score
In this case, we see
that for every unit
increase in Time,
there is a two-unit
increase in Test
Score.
In other words,
2 we can say we
go up 2 for every
1 1 we go over.
8
Now let’s stop talking about
Studying and talk about Eating. This
9 Height of the
Empire State
line has a slope of 3, and, regardless Building
When the slope is
of how small a value for Time Spent
0, and the y-axis
Eating, our Fullness goes up 3 times
value never
Fullness that amount.
changes,
regardless of the
x-axis value…
Thus, the Derivative, …then the
the change in Fullness Derivative, the d Height
=0
relative to the change change in the y-axis d Occupants
in Time, is 3. value relative to the
change in the x-axis
d Fullness value, is 0.
=3
d Time Number of Occupants
x-axis
289
fi
Appendix D: The Main Ideas of Derivatives
Lastly, when we have a curved
11 line instead of a straight line…
Awesomeness
Slope = 3
Likes StatQuest
Likes StatQuest
BAM!!!
290
fi
Appendix E:
291
Appendix E: The Power Rule
In this example, we have a parabola that represents the
1 relationship between Awesomeness and Likes StatQuest.
This is the equation for
the parabola:
d
Awesomeness
Likes StatQuest d Likes StatQuest
d
(Likes StatQuest)2
The Power Rule tells us to d Likes StatQuest
3 multiply Likes StatQuest by the
power, which, in this case, is 2… …and then applying The Power Rule.
d
(Likes StatQuest)2 = 2 x Likes StatQuest2-1
d Likes StatQuest …and raise Likes
= 2 x Likes StatQuest StatQuest by the original
power, 2, minus 1…
Awesomeness
BAM!!!
Now let’s look at a
fancier example!!!
Likes StatQuest
292
fi
Appendix E: The Power Rule
Here we have a graph of how
5 Happy people are relative to
how Tasty the food is…
…and this is the equation
for the squiggle:
…by plugging in
the equation for
Cold, greasy fries from
yesterday. Yuck.
Happy…
d
(1 + Tasty3)
d Tasty
…and taking the
derivative of each
d d d term in the equation.
(1 + Tasty3) = 1 + Tasty3
d Tasty d Tasty d Tasty
The
Chain
Rule!!!
NOTE: This appendix assumes that you’re already
familiar with the concept of a derivative (Appendix D)
and The Power Rule (Appendix E).
294
Appendix F: The Chain Rule
Here we have
Height Height
X
Weight
Height
X Yeah!!!
Height
X
295
fi
fi
fi
Appendix F: The Chain Rule
Now, if someone …then we can predict that …because Weight
6 tells us that they this is their Shoe Size… and Shoe Size are
weigh this much… connected by
Shoe Size
Height…
Height
X
X
Height
X
X
Weight
make it smaller, then the value for Shoe Size
changes. In this case, Shoe Size gets smaller, too.
Height
Shoe Size
8 X
Height
Because this straight
line goes up 2 units X X
X
for every 1 unit over,
the slope and
X
Height
X
derivative are 2… X X
Weight
Now, if we want to quantify how much Shoe
Size changes when we change Weight, we
need to calculate the derivative for Shoe Size
2 d Height with respect to Weight.
=2
d Weight
1
…and that means d Height
Weight the equation for Height = x Weight = 2 x Weight
Height is: d Weight
d Size d Height
Shoe Size = x x Weight
d Height d Weight
BAM!!! 297
Appendix F: The Chain Rule,
A More Complicated Example
Now imagine we measured how So, we t a quadratic line with an
1 Hungry a bunch of people were
and how long it had been since
intercept of 0.5 to the measurements to
re ect the increasing rate of Hunger.
they last had a snack.
Hunger Hunger
Hunger
X
Time Since
X
Hunger
Last Snack
Craves Ice
Cream
X X
…is the derivative of
Craves Ice Cream with
…multiplied by the
derivative of Hunger
respect to Hunger… with respect to Time.
d Hunger
= 2 x Time
d Time
Craves Ice Cream = (Hunger)1/2
Time
Craves d Craves
= 1/2 x Hunger -1/2
d Hunger
=
1
2 x Hunger1/2
10
Hunger
NOTE: In this example, it was
obvious that Hunger was the link
Now we just plug
9 the derivatives into
The Chain Rule…
between Time Since Last Snack and
Craves Ice Cream, which made it
easy to apply The Chain Rule.
d Craves d Craves d Hunger However, usually we get both
= x equations jammed together like
d Time d Hunger d Time
this…
1 Craves Ice Cream = (Time2 + 0.5)1/2
= x (2 x Time)
2 x Hunger1/2 …and it’s not so obvious how The
…and we see that when Chain Rule applies. So, we’ll talk
2 x Time there’s a change in Time
= about how to deal with this next!!!
2 x Hunger1/2 Since Last Snack, the
BAM!!!
change in Craves Ice
d Craves Time Cream is equal to Time
=
d Time Hunger1/2 divided by the square
root of Hunger.
299
Appendix F: The Chain Rule,
When The Link Is Not Obvious
1
In the last part, we said that raising the
sum by 1/2 makes it di cult to apply 2 First, let’s create a link
between Time and Craves
The Power Rule to this equation… Ice Cream called Inside,
which is equal to the stu inside
the parentheses…
Craves Ice Cream = (Time2 + 0.5)1/2
Inside = Time2 + 0.5
…but there was an obvious way to link Time to …and that means Craves Ice
Craves with Hunger, so we determined the Cream can be rewritten as the
derivative with The Chain Rule. square root of the stu Inside.
However, even when there’s no obvious way to
link equations, we can create a link so that we
can still apply The Chain Rule. Craves Ice Cream = (Inside)1/2
d Craves 1
= x (2 x Time)
d Time 2 x Inside1/2
5 Lastly, we plug them
into The Chain Rule… 2 x Time
=
…and just like when the link, 2 x Hunger1/2
Hunger, was obvious, when we
d Craves Time
created a link, Inside, we got the =
exact same result. BAM!!! d Time Hunger1/2
The idea for this book came from comments on my YouTube channel. I’ll admit, when I
rst saw that people wanted a StatQuest book, I didn’t think it would be possible
because I didn’t know how to explain things in writing the way I explained things with
pictures. But once I created the StatQuest study guides, I realized that instead of
writing a book, I could draw a book. And knowing that I could draw a book, I started to
work on this one.
This book would not have been possible without the help of many, many people.
First, I’d like to thank all of the Triple BAM supporters on Patreon and YouTube: U-A
Castle, J. Le, A. Izaki, Gabriel Robet, A. Doss, J. Gaynes, Adila, A. Takeh, J. Butt, M.
Scola, Q95, Aluminum, S. Pancham, A. Cabrera, and N. Thomson.
I’d also like to thank my copy editor, Wendy Spitzer. She worked magic on this book
by correcting a million errors, giving me invaluable feedback on readability, and making
sure each concept was clearly explained. Also, I’d like to thank Adila, Gabriel Robet,
Mahmud Hasan, PhD, Ruizhe Ma, Samuel Judge and Sean Moran who helped me
with everything from the layout to making sure the math was done properly.
Lastly, I’d like to thank Will Falcon and the whole team at Lightning AI.
302
fi
Index
Index
304
fi
fi
fl
ff
fi