Professional Documents
Culture Documents
Guide to Machine
Learning!!!
TRIPLE
BAM!!!
By Josh Starmer, Ph.D.
Sold to
shreyaspaig@gmail.com
The StatQuest Illustrated Guide to Machine Learning!!!
Copyright © 2022 Joshua Starmer
www.statquest.org
2
Thank you for buying
Every penny of your purchase goes to supporting StatQuest and helps create new
videos, books, and webinars. Thanks to you, StatQuest videos and webinars are free on
YouTube for everyone in the whole wide world to enjoy.
NOTE: If you’re borrowing this copy of The StatQuest Illustrated Guide to Machine
Learning from a friend, feel free to enjoy it. However, if you nd it helpful, consider
buying your own copy or supporting StatQuest however you can at
https://statquest.org/support-statquest/.
3
fi
For F and B, for teaching me to think di erently,
4
ff
ff
Since every StatQuest starts with a Silly Song…
C Major
G Major
C Major
here!!!
- Quest
The Stat Il - lus Guide is
- tra - ted
Hooray!!!
Scan, click, or tap
this QR code to hear
the Silly Song!!!
StatQuest!!!
I’m Josh Starmer, and welcome to The
StatQuest Illustrated Guide to Machine
Learning!!! In this book, we’ll talk about
Table of Contents
01 Fundamental Concepts in Machine Learning!!!
8
02 Cross Validation!!!
21
04 Linear Regression!!!
75
05 Gradient Descent!!!
83
06 Logistic Regression!!!
108
07 Naive Bayes!!!
120
10 Decision Trees!!!
183
12 Neural Networks!!!
234
Appendices!!! 271
6
fi
fi
How This Book Works
NOTE: Before we get started, let’s Each page starts with a header
1 talk a little bit about how this book 2 that tells you exactly what
concept we’re focusing on.
works by looking at a sample page.
7
Chapter 01
Fundamental
Concepts in Machine
Learning!!!
Machine Learning: Main Ideas
1 Hey Normalsaurus, can you
summarize all of machine learning
Sure thing StatSquatch! Machine Learning
(ML) is a collection of tools and techniques that
in a single sentence?
transforms data into (hopefully good) decisions
by making classi cations, like whether or not
someone will love a movie, or quantitative
predictions, like how tall someone is.
A Solution: We can use our data a Once the Classi cation Tree is built, we can
2 to build a Classi cation Tree
(for details, see Chapter 10) to
use it to make Classi cations by starting at the
top and asking the question, “Are you
Classify a person as someone interested in machine learning?”
who will like StatQuest or not.
b If you’re not
g BAM!!! interested in
machine learning,
Now let’s learn the main ideas Are you interested in
of how machine learning is Machine Learning? go to the right…
used for Regression.
Yes No
c …and now we
ask, “Do you like
Silly Songs?”
Then you will like Do you like Silly
f StatQuest!!! Songs?
And if you are
interested in machine Yes No
learning, then the
Classi cation Tree
predicts that you will Then you will like
StatQuest!!! :(
like StatQuest!!!
A Solution: Using a method called Linear …then we could use the Because there are lots
2 Regression (for details, see Chapter 4), we can
t a line to the original data we collected and
line to predict that this is
your Height. BAM!!!
of machine learning
methods to choose
use that line to make quantitative predictions. from, let’s talk about
how to pick the best
one for our problem.
Height
The line, which goes up
as the value for Weight
increases, summarizes
Weight…
X
the trend we saw in the
Height
data: as a person’s
Weight increases,
generally speaking, so
does their Height.
X
Weight
Weight
11
fi
ff
Comparing Machine Learning Methods: Main Ideas
The Problem: As we’ll learn in this book, machine
1 learning consists of a lot of di erent methods that …or we could use this green
allow us to make Classi cations or Quantitative squiggle to predict Height from
Predictions. How do we choose which one to use? Weight.
Weight Weight
A Solution: In machine …the black line predicts In contrast, the green squiggle predicts
2 learning, deciding which this Height. that the person will be slightly taller.
method to use often means We can compare
just trying it and seeing how
X
those two predictions
well it performs.
X
to the person’s actual
Height to determine
the quality of each
prediction.
For example,
BAM!!!
Height
given this person’s
Weight… Now that we understand
the Main Ideas of how to
X X
compare machine
learning methods, let’s
Weight Weight get a better sense of how
we do this in practice.
12
fi
ff
Comparing Machine Learning Methods: Intuition Part 1
Alternatively, we could have t a
1
The original data that we use to
observe the trend and t the line
2 green squiggle to the Training Data.
is called Training Data. The green squiggle
ts the Training Data
In other words, the better than the black
black line is t to the line, but remember
Training Data. the goal of machine
learning is to make
predictions, so we
Height Height need a way to
determine if the
black line or the
green squiggle
makes better
predictions.
Weight Weight
13
fi
fi
fi
fi
Comparing Machine Learning Methods: Intuition Part 2
…then we can compare their Observed The rst person in the
4
Now, if these blue dots
are the Testing Data…
5 Heights to the Heights Predicted by the 6 Testing Data had this
black line and the green squiggle. Weight…
X
Weight Weight
X
We can then add the two
7
However, the black
line predicts that they
8
Likewise, we measure the error
between the Observed and
9 errors together to get a
are taller… sense of how close the two
Predicted values for the second
Predictions are to the
person in the Testing Data. Observed values for the
black line.
Height
X
X
…and we can
measure the distance, X Second
Error
or error, between the
X + =
Height
Observed and Total
Predicted Heights. Error
First
X Error
Weight
X
Weight
14
fi
Comparing Machine Learning Methods: Intuition Part 3
We can then add the two
10 Likewise, we can measure the distances, or errors, between the
Observed Heights and the Heights Predicted by the green squiggle.
11 errors together to get a sense
of how close the Predictions
are to the Observed values
for the green squiggle.
Height
X
X X
Second
Error
X + = Total
Error
First
Error
X Weight
X
Weight
Total Green
Squiggle Errors Height Height
Total Black And we see that the
Line Errors sum of the errors for
the black line is shorter,
suggesting that it did a
better job making
predictions. Weight Weight
15
fi
Comparing Machine Learning Methods: Intuition Part 4
Weight
16
fi
fi
f
fi
fi
BAM!!!
Support Vector Machines Error Total Fancy
Neural Networks Method Error
Total Black
Line Error
X nice table.
Now, regardless of whether we
look at the data in the graph or in
Weight Height
the table, we can see that Weight
Height 0.4 1.1
varies from person to person, and
thus, Weight is called a Variable. 1.2 1.9
1.9 1.7
Likewise, Height varies from
X
2.0 2.8
person to person, so Height
Weight is also called a Variable. 2.8 2.3
18
ff
fi
Terminology Alert!!! Discrete and Continuous Data
Discrete Data… Continuous Data…
1 …is countable and only
5 …is measurable and can take
takes speci c values. any numeric value within a range.
152.11 cm
Rankings and other orderings are also
4 Discrete. There is no award for coming
in 1.68 place. Total bummer! So the precision
of Continuous
measurements is
1st 2nd 3rd only limited by
the tools we use.
19
fi
Now we know about di erent
types of data and how they can
be used for Training and
Testing. What’s next? In the next chapter, we’ll learn how to
select which data points should be used
for Training and which should be used
for Testing using a technique called
Cross Validation.
20
ff
Chapter 02
Cross Validation!!!
Cross Validation: Main Ideas
The Problem: So far, we’ve However, usually no one tells us
1 simply been told which points
are the Training Data…
what is for Training and what is
for Testing.
…and which
points are the ? How do we pick the best
Testing Data. ? points for Training and the
A Solution: When we’re not told which data Rather than worry too much about which speci c
2 should be used for Training and for Testing, points are best for Training and best for Testing,
Cross Validation uses all points for both in an
we can use Cross Validation to gure out
which is which in an unbiased way. iterative way, meaning that we use them in steps.
BAM!!!
Testing set #1 Testing set #2 Testing set #3
22
fi
fi
Cross Validation: Details Part 1
…so we decide to t a line to the data with
Imagine we gathered these 6
2 Linear Regression (for details, see Chapter 4).
Height
Weight
TERMINOLOGY ALERT!!!
Reusing the same data for Training
and Testing is called Data Leakage,
and it usually results in you believing
the machine learning method will
perform better than it does because
it is Over t.
23
fi
fi
fi
Now, in the rst Cross Validation Then, just like before, we can
6 iteration, we’ll use Groups 1 and …and Group 3 for Testing. 7 measure the errors for each
2 for Training… point in the Testing Data…
Iteration #1:
Black Line
Errors
…however, unlike before, we
don’t stop here. Instead, we
keep iterating so that Groups
1 and 2 can each get a turn
to be used for Testing.
24
fi
fi
ff
A Gentle Reminder…
9 3 iterations of
Testing.
Training…
Iteration #1 A di erent tted line
combined with using
NOTE: Because di erent data for Testing
each iteration Groups 2 and 3 Group 1 results in each iteration
uses a di erent giving us di erent
combination of prediction errors.
data for Training,
each iteration
results in a We can average
slightly di erent Iteration #2 these errors to get a
tted line. general sense of how
well this model will
perform with future
Groups 1 and 3 Group 2 data…
BAM!!!
slightly di erent Groups 1 and 3 Group 2
tted line and
tted squiggle.
NOTE: In this example, the
black line consistently
performed better than the
Iteration #3
green squiggle, but this isn’t
vs.
usually the case. We’ll talk
more about this later.
Groups 1 and 2 Group 3
26
fi
fi
ff
ff
fi
Cross Validation: Details Part 5
2 2 2
3 3 Then, we 3
Train using
4 4 4
the rst 9
5 5 blocks… 5
…and Test
6 6 using the 6
tenth block.
7 7 7
8 8 8
9 9 9
10 10
27
fi
fi
Cross Validation: Details Part 6
Another commonly used form of Cross
15 Validation is called Leave-One-Out.
Leave-One-Out Cross Validation …and uses the one remaining …and then iterates until every single
uses all but one point for Training… point for Testing… point has been used for Testing.
28
Cross Validation: Details Part 7
And, after doing all of the iterations,
we’re left with a variety of results,
When we use Cross Validation to
16 compare machine learning methods,
some showing that the black line
is better…
for example, if we wanted to compare
a black line to a green squiggle…
…sometimes the black line
will perform better than the Iteration #3 Iteration #4
green squiggle…
Iteration #5 Iteration #6
vs.
…and sometimes the black vs.
line will perform worse than
the green squiggle.
TRIPLE BAM!!!
29
Chapter 03
Fundamental
Concepts in
Statistics!!!
Statistics: Main Ideas
For example, every time we
The Problem: The world is an order french fries, we don’t
1 interesting place, and things are always get the exact same
number of fries. YUM!!! vs. YUM!
not always the same.
Fry Diary
Monday: 21 fries
Thursday: ???
YUM!!! For example, if we predict that the medicine will help, but
we’re not very con dent in that prediction, we might not
…and statistics can help us recommend the medicine and use a di erent therapy to
predict how many fries we’ll get help the patient.
the next time we order them,
and it tells us how con dent we
should be in that prediction. The rst step in making predictions is to identify
3 trends in the data that we’ve collected, so let’s
talk about how to do that with a Histogram.
31
fi
fi
fi
fi
ff
fi
fi
Histograms: Main Ideas
The Problem: We have a lot of measurements and
1 want to gain insights into their hidden trends.
Shorter Taller
BAM!!!
32
Histograms: Details Because most of the
We can use the histogram to measurements are inside this
The taller the stack within a bin,
2 estimate the probability of
getting future measurements.
red box, we might be willing
1 the more measurements we
to bet that the next
measurement we make will
made that fall into that bin. be somewhere in this range.
Shorter Taller
Shorter Taller
Extremely short or tall
measurements are rarer and less
likely to happen in the future.
NOTE: Figuring out how wide to
3 make the bins can be tricky.
…and if the bins are too narrow,
then they’re not much help…
If the bins are too In Chapter 7, we’ll use
wide, then they’re not histograms to make
much help… classi cations using a machine
Shorter Taller learning algorithm called Naive
Bayes. GET EXCITED!!!
BAM!
Shorter Taller
Shorter Taller
33
fi
ff
Histograms: Calculating Probabilities Step-by-Step
34
fi
fi
Histograms: Calculating Probabilities Step-by-Step
To estimate the probability that the
3 next measurement will be in a red
box that spans all of the data…
…we count the number
of measurements in the
box, 19…
Shorter Taller
35
fi
fi
fi
Probability Distributions: Main Ideas
The Problem: If we don’t …however, collecting tons
1 have much data, then we
can’t make very precise
of data to make precise
estimates can be time-
probability estimates consuming and expensive.
with a histogram… Is there another way?
YES!!!
Shorter Taller
The Binomial
A Solution: When we have discrete data, instead of collecting Distribution makes
me want to run
2 a ton of data to make a histogram and then worrying about
blank spaces when calculating probabilities, we can let away and hide.
mathematical equations do all of the hard work for us.
Don’t be scared
‘Squatch. If you keep
One of the most commonly used As you can see, it’s a mathematical equation, so it reading, you’ll nd that
3 Discrete Probability Distributions doesn’t depend on collecting tons of data, but, at it’s not as bad as it
is the Binomial Distribution. least to StatSquatch, it looks super scary!!! looks.
( x!(n − x)! )
n! inside, the Binomial Distribution is really
p(x | n, p) = p x(1 − p)n−x simple. However, before we go through it
one step a time, let’s try to understand
the main ideas of what makes the
equation so useful.
37
fi
The Binomial Distribution: Main Ideas Part 1
First, let’s imagine we’re walking down
1 the street in StatLand and we ask the
rst 3 people we meet if they prefer
…and the rst 2 people say they …and the last person
pumpkin pie or blueberry pie…
prefer pumpkin pie… says they prefer
Pumpkin Pie Blueberry Pie 2 blueberry pie.
38
fi
fi
fi
fi
fi
fi
fi
fi
The Binomial Distribution: Main Ideas Part 2
It could have just as easily been the
4 case that the rst person said they
prefer blueberry and the last two 5
Likewise, if only the second person said
they prefer blueberry, we would multiply the
said they prefer pumpkin. numbers together in a di erent order and
In this case, we would still get 0.147.
multiply the numbers
together in a di erent
order, but the probability
would still be 0.147 (see
Appendix A for details).
So, we see that all three …and that means that the probability of observing that 2
6 combinations are
equally probable…
7 out of 3 people prefer pumpkin pie is the sum of the 3
possible arrangements of people’s pie preferences, 0.441.
0.3 x 0.7 x 0.7 = 0.147 0.3 x 0.7 x 0.7 = 0.147 NOTE: Calculating by
39
ff
fi
ff
ff
The Binomial Distribution: Main Ideas Part 3
So, instead of drawing out di erent arrangements of
8
However, things quickly get tedious when we start
asking more people which pie they prefer. 9 pie slices, we can use the equation for the Binomial
Distribution to calculate the probabilities directly.
For example, if we
( x!(n − x)! )
wanted to
n!
calculate the
probability of
p(x | n, p) = p x(1 − p)n−x
observing that 2
out of 4 people
prefer pumpkin
pie, we have to BAM!!!
calculate and sum
the individual In the next pages, we'll use the Binomial Distribution to
probabilities from calculate the probabilities of pie preference among 3 people,
6 di erent but it works in any situation that has binary outcomes, like
arrangements… wins and losses, yeses and noes, or successes and failures.
40
ff
ff
The Binomial Distribution: Details Part 1
Gentle Reminder: We’re
using the equation for the
First, let’s focus on just the left-
1 hand side of the equation.
Binomial Distribution to
calculate the probability
that 2 out of 3 people
In our pie example, x is the prefer pumpkin pie…
…n is the number of …and p is the probability that
number of people who
people we ask. In this someone prefers pumpkin
prefer pumpkin pie, so in
case, n = 3… pie. In this case, p = 0.7… 0.3 x 0.7 x 0.7 = 0.147
this case, x = 2…
+
0.7 x 0.3 x 0.7 = 0.147
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x +
0.7 x 0.7 x 0.3 = 0.147
41
The Binomial Distribution: Details Part 2
Gentle Reminder: We’re
Now, let’s look at the rst part on the right-hand side of the
3 equation. StatSquatch says it looks scary because it has factorials
(the exclamation points; see below for details), but it’s not that bad.
using the equation for the
Binomial Distribution to
calculate the probability
that 2 out of 3 people
prefer pumpkin pie…
Despite the factorials, the rst term simply …and, as we saw earlier, there are 3
represents the number of di erent ways we can di erent ways that 2 out of 3 people
meet 3 people, 2 of whom prefer pumpkin pie… we meet can prefer pumpkin pie. 0.3 x 0.7 x 0.7 = 0.147
+
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x 0.7 x 0.3 x 0.7 = 0.147
+
0.7 x 0.7 x 0.3 = 0.147
4 When we plug in
x = 2, the number
of people who
prefer pumpkin
…and n = 3, the number of
people we asked, and then
…we get 3, the same
number we got when we = 0.441
pie… do the math… did everything by hand.
n! 3! 3! 3x2x1 A factorial—indicated by an
= = = =3 exclamation point—is just the product
x! (n - x)! 2! (3 - 2)! 2! (1)! 2x1x1 of the integer number and all positive
Hey Norm, integers below it. For example,
what’s a 3! = 3 x 2 x 1 = 6.
NOTE: If x is the number of people who factorial?
prefer pumpkin pie, and n is the total
number of people, then (n - x) = the number
of people who prefer blueberry pie.
42
ff
fi
fi
ff
The Binomial Distribution: Details Part 3
Gentle Reminder: We’re
Now let’s look at the second
5 term on the right hand side.
using the equation for the
Binomial Distribution to
calculate the probability
that 2 out of 3 people
The second term is just the In other words, since p, the …and there are x = 2 people prefer pumpkin pie…
probability that 2 out of the 3 probability that someone who prefer pumpkin pie, the
people prefer pumpkin pie. prefers pumpkin pie, is 0.7… second term = 0.72 = 0.7 x 0.7.
0.3 x 0.7 x 0.7 = 0.147
+
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x 0.7 x 0.3 x 0.7 = 0.147
+
0.7 x 0.7 x 0.3 = 0.147
The third and nal …because if p is the …and if x is the number of
6 term is the probability
that 1 out of 3 people
probability that
someone prefers
people who prefer
pumpkin pie and n is the
prefers blueberry pie… pumpkin pie, (1 - p) is
the probability that
total number of people we
asked, then n - x is the
= 0.441
someone prefers number of people who
blueberry pie… prefer blueberry pie.
Just so you know, sometimes
people let q = (1 - p), and use q
in the formula instead of (1 - p). So, in this example, if we plug in
43
fi
The Binomial Distribution: Details Part 4
Gentle Reminder: We’re
using the equation for the
Now that we’ve looked at each part of the equation for the
7 Binomial Distribution, let’s put everything together and solve for
the probability that 2 out of 3 people we meet prefer pumpkin pie.
Binomial Distribution to
calculate the probability
that 2 out of 3 people
We start by plugging in the number of people who prefer pumpkin prefer pumpkin pie…
pie, x = 2, the number of people we asked, n = 3, and the
probability that someone prefers pumpkin pie, p = 0.7… 0.3 x 0.7 x 0.7 = 0.147
+
( x!(n − x)! )
n!
p(x = 2 | n = 3, p = 0.7) = p x(1 − p)n−x 0.7 x 0.3 x 0.7 = 0.147
+
0.7 x 0.7 x 0.3 = 0.147
…then we just
( 2!(3 − 2)! )
do the math… 3!
pr(x = 2 | n = 3, p = 0.5) = 0.72(1 − 0.7)3−2
= 0.441
(Psst! Remember: the rst term is
the number of ways we can
arrange the pie preferences, the
second term is the probability that
pr(x = 2 | n = 3, p = 0.5) = 3 × 0.72 × (0.3)1
TRIPLE
2 people prefer pumpkin pie, and
the last term is the probability that
1 person prefers blueberry pie.)
pr(x
…and | n =is 0.441,
=the2result 3, p = 0.5) = 3 × 0.7 × 0.7 × 0.3
which is the same value we
got when we drew pictures
BAM!!!
pr(x = 2 | n = 3, p = 0.5) = 0.441
of the slices of pie.
44
fi
Hey Norm, I sort of understand the
Binomial Distribution. Is there another
commonly used Discrete Distribution that
you think I should know about?
45
The Poisson Distribution: Details
So far, we’ve seen how the Binomial Distribution gives us probabilities for
1 sequences of binary outcomes, like 2 out of 3 people preferring pumpkin pie, but
there are lots of other Discrete Probability Distributions for lots of di erent
situations.
To summarize, we’ve seen that …and while these can be useful, they
1 Discrete Probability Distributions require a lot of data that can be
can be derived from histograms… expensive and time-consuming to get,
and it’s not always clear what to do
about the blank spaces.
( x!(n − x)! )
n!
p(x | n, p) = p x(1 − p)n−x
e −λλ x 4
There are lots of other Discrete Probability Distributions for
p(x | λ) = lots of other types of data. In general, their equations look
x! intimidating, but looks are deceiving. Once you know what each
symbol means, you just plug in the numbers and do the math.
BAM!!!
Now let’s talk about Continuous
Probability Distributions.
47
Continuous Probability Distributions: Main Ideas
The Problem: Although they can be super useful,
1 beyond needing a lot of data, histograms have
two problems when it comes to continuous data:
…2) histograms can be very
sensitive to the size of the bins.
1) it’s not always clear If the bins are too …and if the bins are too
what to do about gaps wide, then we lose narrow, it’s impossible to
in the data and… all of the precision… see trends.
Shorter Taller
Shorter Taller
Shorter Taller
48
fi
The Normal (Gaussian) Distribution: Main Ideas Part 1
Chances are you’ve It’s also called a Bell-Shaped Curve because it’s
1 seen a Normal or a symmetrical curve…that looks like a bell.
Gaussian
distribution
A Normal distribution is symmetrical
before. about the mean, or average, value.
2 the Likelihood of
observing any
speci c Height.
Less Likely
Shorter Average Height Taller
Here are two Normal For example, it’s relatively …relatively common to …and relatively rare
3 Distributions of the heights rare to see someone who
is super short…
see someone who is close
to the average height…
to see someone
who is super tall.
of male infants and adults.
= Infant
standard deviation for adults, 10.2… 2) The Standard Deviation of the
= Adult measurements. This tells you how
tall and skinny, or short and fat, the
curve should be.
1
The equation for the Normal Distribution looks
scary, but, just like every other equation, it’s just a
2 Normal Distribution works,
let’s calculate the likelihood (the
matter of plugging in numbers and doing the math. y-axis coordinate) for an infant
1 −(x−μ)2 /2σ 2
that is 50 cm tall.
f(x | μ, σ) = e Since the mean of the
2πσ 2 distribution is also 50 cm,
we’ll calculate the y-axis
coordinate for the highest
part of the curve.
The Greek character Lastly, the Greek
x is the x-axis
coordinate. So, in μ, mu, represents character σ, sigma,
3 this example, the x-
axis represents
the mean of the represents the
distribution. In this standard deviation of
50
Height in cm.
1 2
Remember, the output from the equation, the y-axis coordinate,
= e −0 /4.5 is a likelihood, not a probability. In Chapter 7, we’ll see how
likelihoods are used in Naive Bayes. To learn how to calculate
14.1 probabilities with Continuous Distributions, read on…
51
Calculating Probabilities with Continuous
Probability Distributions: Details
Regardless of …or short and fat
For Continuous Probability Distributions,
1 probabilities are the area under the curve 2 how tall and
skinny…
a distribution is…
between two points.
1) The hard way, by using 2) The easy way, by using a 142.5 cm 155.7 cm 168.9 cm
calculus and integrating the computer. See Appendix C
equation between the two for a list of commands. One way to understand , Another way is to realize that a continuous
points a and b. why the probability is 0 distribution has in nite precision, thus,
is to remember that we’re really asking the probability of
b
∫a
Area = 0.48
probabilities are areas, measuring someone who is exactly
f(x) dx and the area of 155.7000000000000000000000
BAM!!! something with no 00000000000000000000000000000000000
width is 0. 000000000000000000000000000000… tall.
UGH!!! NO ONE
ACTUALLY DOES THIS!!!
52
fi
fi
fi
Norm, the Normal Distribution is
awesome, and, to be honest, it sort of
looks like you, which is cool, but are
there other Continuous Distributions
we should know about?
Sure ‘Squatch! We should
probably also learn about the
Exponential and Uniform
Distributions.
53
Other Continuous Distributions: Main Ideas
Using Distributions To Generate
Random Numbers
Exponential Distributions are commonly used when We can get a computer to generate
1 we’re interested in how much time passes between
events. For example, we could measure how many
numbers that re ect the likelihoods of
any distribution. In machine learning,
minutes pass between page turns in this book. we usually need to generate random
More Likely numbers to initialize algorithms before
training them with Training Data.
Random numbers are also useful for
randomizing the order of our data,
which is useful for the same reasons
we shu e a deck of cards before
Less Likely
playing a game. We want to make sure
everything is randomized.
Uniform Distributions are commonly NOTE: Because there are fewer values between 0 and 1
2 used to generate random numbers
that are equally likely to occur.
than between 0 and 5, we see that the corresponding
likelihood for any speci c number is higher for the
Uniform 0,1 Distribution than the Uniform 0,5
For example, if I Distribution.
want to select
random numbers More Likely More Likely More Likely
between 0 and 1,
then I would use a
Uniform
Distribution that
goes from 0 to 1,
which is called a
Uniform 0,1 Less Likely Less Likely Less Likely
Distribution,
0 1 0 5 0 5
because it
ensures that every In contrast, if I wanted to generate random Uniform Distributions can
value between 0 numbers between 0 and 5, then I would use a span any 2 numbers, so we
and 1 is equally Uniform Distribution that goes from 0 to 5, could have a Uniform 1,3.5
likely to occur. which is called a Uniform 0,5 Distribution. Distribution if we wanted one.
54
ffl
fl
fi
Continuous Probability Distributions: Summary
1
Just like Discrete Distributions, Continuous
Distributions spare us from having to gather
2 …and, additionally,
tons of data for a histogram… Continuous Distributions
also spare us from having
to decide how to bin the
181 cm data.
vs.
Shorter Taller
Shorter Taller
Instead, Continuous Distributions use equations Like Discrete Distributions, there are Continuous
3 that represent smooth curves and can provide
likelihoods and probabilities for all possible 4 Distributions for all kinds of data, like the values
we get from measuring people’s height or timing
measurements. how long it takes you to read this page.
1 −(x−μ)2 /2σ 2
f(x | μ, σ) = e In the context of machine learning, both types
of distributions allow us to create Models that
2πσ 2 can predict what will happen next.
Weight
Shorter Taller
56
fi
Models: Main Ideas Part 2 Models, or equations, can tell us
Height
Weight
X …we plug the
Height = 0.5 + (0.8 x Weight) Weight into the
equation and solve
Weight
for Height…
Height = 0.5 + (0.8 x Weight)
Height = 0.5 + (0.8 x 2.1)
Because models are only approximations, …and get
Height = 2.18
5 it’s important that we’re able to measure
the quality of their predictions.
2.18.
Weight
A Solution: One way quantify the quality of a
2 model and its predictions is to calculate the
Sum of the Squared Residuals.
Visually, we can draw
Residuals with these
green lines.
As the name implies, we start by calculating
Residuals, the di erences between the Observed
values and the values Predicted by the model.
Since, in general, the smaller the
Residuals, the better the model ts the
data, it’s tempting to compare models by
Residual = Observed - Predicted comparing the sum of their Residuals,
but the Residuals below the blue line
would cancel out the ones above it!!!
∑
each Observation. SSR = (Observedi - Predictedi)2 summation. easy to take the derivative, which
For example, i = 1 will come in handy when we do
refers to the rst i=1 Gradient Descent in Chapter 5.
Observation.
58
fi
fi
ff
ff
fi
fi
The Sum of the Squared Residuals: Main Ideas Part 2
So far, we’ve looked at the SSR only in terms …and here’s an example of the Residuals for
3 of a simple straight line model, but we can
calculate it for all kinds of models.
a sinusoidal model of rainfall. Some months
are more rainy than others, and the pattern is
cyclical over time. If you can calculate the
Residuals, you can square them and add
Here’s an example of the them up!
Residuals for altitude vs. time
for SpaceX’s SN9 rocket…
Altitude
Rainfall
Oops!
Time
Time
1 Observations, so n = 3,
and we expand the
Predicted =
Residual =
summation into 3 terms.
For i = 1, the
term for the rst
n
Observation…
The Sum of Squared =
∑
(Observedi - Predictedi)2 (1.9 - 1.7)2
Residuals (SSR)
i=1
n
The Sum of Squared
= ∑
Residuals (SSR) (Observedi - Predictedi)2
Mean Squared Error =
(MSE) i=1 n
Number of
Observations, n
61
fi
ff
fi
fi
fi
Mean Squared Error (MSE): Step-by-Step
Now let’s see the n
1 MSE in action by
calculating it for the SSR ∑ (Observedi - Predictedi)2
Mean Squared Error (MSE) = = i=1
two datasets!!! n n
The rst dataset has only 3 points The second dataset has 5 points and the
2 and the SSR = 14, so the Mean
Squared Error (MSE) is 14/3 = 4.7.
SSR increases to 22. In contrast, the
MSE, 22/5 = 4.4, is now slightly lower.
So, unlike the SSR,
which increases when
we add more data to
SSR 14 SSR 22 the model, the MSE
= = 4.7 = = 4.4 can increase or
n 3 n 5 decrease depending
on the average
residual, which gives
us a better sense of
how the model is
performing overall.
Unfortunately, For example, if the y-axis is in However, if we change the y-axis to meters, then the
3 MSEs are still millimeters and the Residuals are 1, Residuals for the exact same data shrink to 0.001, -0.003,
di cult to -3, and 2, then the MSE = 4.7. and 0.002, and the MSE is now 0.0000047. It’s tiny!
interpret on their
own because the
maximum values
The good news, however, is
millimeters
R2 = 1.6 - 0.5
When SSR(mean) = SSR( tted line),
…and the result, 0.7, tells us then both models’ predictions are equally When SSR( tted line) = 0,
1.6 that there was a 70% good (or equally bad), and R2 = 0 meaning that the tted line ts
reduction in the size of the the data perfectly, then R2 = 1.
R2 = 0.7 Residuals between the SSR(mean) - SSR( tted line)
mean and the tted line. SSR(mean)
0 SSR(mean) - 0 SSR(mean)
= =0 = =1
SSR(mean) SSR(mean) SSR(mean)
64
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
R2: Details Part 2 R2 =
SSR(mean) - SSR( tted line)
SSR(mean)
SSR(mean)
= =1
SSR(mean)
Because a small amount of random data can have a high (close to 1) R2, any time we see a trend in a
small dataset, it’s di cult to have con dence that a high R2 value is not due to random chance.
Never satis ed with intuition, statisticians developed something called p-values to quantify how much
7 con dence we should have in R2 values and pretty much any other method that summarizes data. We’ll
talk about p-values in a bit, but rst let’s calculate R2 using the Mean Squared Error (MSE).
65
fi
fi
fi
fi
ffi
fi
fi
fi
Calculating R2 with the Mean Squared Error (MSE): Details
So far, we’ve calculated R2 using the Sum of the Squared Gentle Reminders:
Residuals (SSR), but we can just as easily calculate it
using the Mean Squared Error (MSE). Residual = Observed - Predicted
R2
…we end up with R2 times 1, which is just R2. So, we
= can calculate R2 with the SSR or MSE, whichever is
readily available. Either way, we’ll get the same value.
BAM!!!
66
fi
ff
fi
fi
fi
fi
R2: FAQ Can R2 be negative?
When we’re only comparing the mean to
Does R2 always compare the mean to a straight tted line? a tted line, R2 is positive, but when we
compare other types of models, anything
The most common way to calculate R2 is to compare the mean to a can happen.
Rainfall
Time Time
SSR(line) - SSR(parabola)
In this case, we calculate R2
based on the R2 =
Sum of the Squared Residuals around
SSR(line)
the square and sine waves. 5 - 12
R2 = = -1.4
5
SSR(square) - SSR(sine) …we get a negative R2 value, -1.4,
R2 = and it tells us the Residuals
SSR(square) increased by 140%.
Is R2 related to
Yes! If you can calculate Pearson’s correlation
coe cient, ρ (the Greek character rho) or r, for a
Pearson’s correlation relationship between two things, then the square
coe cient? of that coe cient is equal to R2. In other words…
BAM!!!
ρ2 = r2 = R2 Now let’s talk
about p-values!!!
…and now we can see where R2 got its name.
67
fi
fi
ffi
ffi
ffi
fi
p-values: Main Ideas Part 1 So we gave Drug A to 1 person
and they were cured…
…and we gave Drug B to
another person and they
were not cured.
The Problem: We need to
1 quantify how con dent we should
be in the results of our analysis.
only focus on determining know if they were di erent. No!!! Drug B may have failed for a lot of reasons. Maybe
whether or not Drug A is
this person is taking a medication that has a bad
di erent from Drug B. If a p-
interaction with Drug B, or maybe they have a rare allergy
value allows us to establish a Drug A Drug B to Drug B, or maybe they didn’t take Drug B properly and
di erence, then we can worry missed a dose.
vs.
about whether Drug A is better
or worse than Drug B. Or maybe Drug A doesn’t actually work, and the placebo
e ect deserves all of the credit.
Drug A Drug B It’s possible that some of the people taking Drug A were actually
cured by placebo, and some of the people taking Drug B were not
cured because they had a rare allergy, but there are just too many
Cured!!! Not Cured Cured!!! Not Cured people cured by Drug A, and too few cured by Drug B, for us to
seriously think that these results are just random and that Drug A
1,043 3 2 1,432 is no di erent from Drug B.
68
ff
ff
ff
ff
fi
fi
ff
ff
ff
ff
p-values: Main Ideas Part 2 5
In practice, a commonly used threshold is 0.05. It
means that if there’s no di erence between Drug
A and Drug B, and if we did this exact same
experiment a bunch of times, then only 5% of
those experiments would result in the wrong
In contrast, let’s say that these were the results… decision.
4 Drug A Drug B
Drug A cured a larger percentage of people, but given that no Drug A Drug A
study is perfect and there are always a few random things that
vs.
happen, how con dent can we be that Drug A is di erent from
Drug B?
So, the question is, “how small does a p-value have to be Cured!!! Not Cured Cured!!! Not Cured
before we’re su ciently con dent that Drug A is di erent from
Drug B?”
73 125 71 127
p = 0.9
In other words, what threshold can we use to make a good
When we calculate the p-value for these data using a
decision about whether these drugs are di erent?
Statistical Test (for example, Fisher’s Exact Test, but
we’ll save those details for another day) we get 0.9…
…which is larger than 0.05. Thus, we would say that we fail to see
a di erence between these two groups. And that makes sense
because both groups are taking Drug A and the only di erences
are weird, random things like rare allergies.
69
ff
ff
ff
ff
ffi
fi
ff
ff
fi
ff
fi
fi
ff
ff
ff
ff
fi
p-values: Main Ideas Part 3 However, every once in a while, by random
8 chance, all of the people with rare allergies
might end up in the group on the left…
Cured!!! Not Cured Cured!!! Not Cured 30% Cured 42% Cured
Cured!!! Not Cured Cured!!! Not Cured Thus, because the p-value is < 0.05 (the threshold we’re using
75 123 70 128 for making a decision), we would say that the two groups are
p = 0.7 di erent, even though they both took the same drug!
etc.
etc.
etc.
etc.
etc.
etc.
TERMINOLOGY ALERT!!!
etc. etc. etc. Getting a small p-value when
there is no di erence is called a
Cured!!! Not Cured Cured!!! Not Cured
False Positive.
69 129 71 127
p = 0.9 A 0.05 threshold for p-values means that 5% of the
experiments, where the only di erences come from weird,
random things, will generate a p-value smaller than 0.05.
Drug A Drug B
If it’s extremely important that we’re correct when we
9 say the drugs are di erent, then we can use a smaller
threshold, like 0.01 or 0.001 or even smaller.
Cured!!! Not Cured Cured!!! Not Cured
Using a threshold of 0.001 would get a False 73 125 59 131
Positive only once in every 1,000 experiments.
Likewise, if it’s not that important (for example, if …if we calculate a p-value for this experiment and the
we’re trying to decide if the ice-cream truck will arrive p-value < 0.05, then we’ll decide that Drug A is di erent
on time), then we can use a larger threshold, like 0.2.
from Drug B.
Using a threshold of 0.2 means we’re willing to get a That said, the p-value = 0.24, (again calculated using
False Positive 2 times out of 10.
Fisher’s Exact Test), so we’re not con dent that Drug A
That said, the most common threshold is 0.05 is di erent from Drug B.
because trying to reduce the number of False
Positives below 5% often costs more than it’s worth.
TERMINOLOGY ALERT!!!
BAM!!!
In fancy statistical lingo, the idea of trying
to determine if these drugs are the same
or not is called Hypothesis Testing.
71
ff
ff
fi
ff
p-values: Main Ideas Part 5
While a small p-value helps us decide if Drug A is di erent In contrast, this experiment involving a lot more people
DOUBLE BAM!!!
Now that we understand
the main ideas of p-values,
let’s summarize the main
ideas of this chapter.
72
ff
ff
ff
ff
ff
ff
ff
ff
The Fundamental Concepts of Statistics: Summary
However, histograms have limitations (they need a
We can see trends in data with histograms. 2 lot of data and can have gaps), so we also use
1 We’ll learn how to use histograms to make probability distributions to represent trends. We’ll
classi cations with Naive Bayes in Chapter 7. learn how to use probability distributions to make
classi cations with Naive Bayes in Chapter 7.
Rather than collect all of the data in the whole wide We can evaluate how well a model re ects the
3 world, which would take forever and be way too 4 data using the Sum of the Squared Residuals
(SSR), the Mean Squared Error (MSE), and R2.
expensive, we use models to approximate reality.
Histograms and probability distributions are We’ll use these metrics throughout the book.
examples of models that we can use to make
predictions. We can also use a mathematical formula, Residual = Observed - Predicted
like the equation for the blue line, as a model that
makes predictions. SSR = Sum of Squared Residuals
Throughout this book, we’ll n
create machine learning
∑
models to make predictions.
SSR = (Observedi - Predictedi)2
Height
i=1
Height = 0.5 + (0.8 x Weight)
SSR
Mean Squared Error (MSE) =
Weight n
…where n is the sample size
Lastly, we use p-values to give us a sense of how
much con dence we should put in the predictions
5 that our models make. We’ll use p-values in R2 =
SSR(mean) - SSR( tted line)
Chapter 4 when we do Linear Regression. SSR(mean)
73
fi
fi
fi
fi
fl
TRIPLE BAM!!!
Hooray!!!
Now let’s learn
about Linear
Regression.
74
Chapter 04
Linear Regression!!!
Linear Regression: Main Ideas
The Problem: We’ve collected
1 Weight and Height measurements
from 5 people, and we want to use
…and in Chapter 3, we learned that
we could t a line to the data and However, 1) we didn’t talk about
use it to make predictions. how we t a line to the data…
Weight to predict Height, which is
continuous…
Height X
Height
con dence we should have in its predictions
compared to just using the mean y-axis value.
Weight X Weight
Height
5
Weight As we can see on the graph, di erent
values for a line’s y-axis intercept and
…and this line has even
slope, shown on the x-axis, change the
smaller residuals and a …and this line has larger SSR, shown on the y-axis. Linear
smaller SSR… residuals and a larger SSR. Regression selects the line, the y-axis
intercept and slope, that results in the
minimum SSR.
Height
Height
BAM!!!
Weight
Weight 77
ff
ff
ff
fi
ff
Fitting a Line to Data: Intuition One way to nd the lowest point
2 in the curve is to calculate the
derivative of the curve (NOTE: If
If we don’t change the slope, …and, in this case, the goal of Linear you’re not familiar with
1 we can see how the SSR
changes for di erent y-axis
Regression would be to nd the y-axis
intercept that results in the lowest SSR at the
derivatives, see Appendix D).
Height SSR
Height
SSR
Weight
Weight
y-axis intercept
Iterative Method called Gradient Descent. In contrast to an Analytical Solution, an Iterative Method starts
3 with a guess for the value and then goes into a loop that improves the guess one small step at a time. Although
Gradient Descent takes longer than an analytical solution, it’s one of the most important tools in machine
learning because it can be used in a wide variety of situations where there are no analytical solutions, including
Logistic Regression, Neural Networks, and many more.
I’m so
Because Gradient Descent is so important, we’ll spend all of Chapter 5 on it. GET EXCITED!!! excited!!!
78
fi
ff
fi
fi
fi
fi
p-values for Linear Regression and R2: Main Ideas
Now, assuming that we’ve t a line to the
1 data that minimizes the SSR using an
analytical solution or Gradient Descent,
…and the SSR for
Gentle Reminder:
The R2 value, 0.66, suggests that using Weight In this context, a p-value tells us the probability that
Because the original dataset …and then add that R2 …and then create >10,000 more
3 has 5 pairs of measurements,
one way* to calculate a p-
to a histogram… sets of random data and add their
…then use Linear R2 values to the histogram…
value is to pair 5 random
Regression to t a line
values for Height with 5
to the random data
random values for Weight and …and use the
plot them on a graph… and calculate R2…
histogram to calculate
the probability that
0 R2 values from 1 random data will give
R2 = 0.03
Random Random random data us an R2 ≥ 0.66.
Height Height
In the end, we get p-value = 0.1, meaning there’s a 10%
Random Weight Random Weight 4 chance that random data could give us an R2 ≥ 0.66.
That’s a relatively high p-value, so we might not have a lot
* NOTE: Because Linear Regression was invented before of con dence in the predictions, which makes sense
computers could quickly generate random data, this is not because we didn’t have much data to begin with.
the traditional way to calculate p-values, but it works!!! small bam. 79
fi
fi
fi
fi
fi
fi
Multiple Linear Regression: Main Ideas
So far, the example we’ve used demonstrates …and, as we’ve seen, Simple Linear
1 something called Simple Linear Regression because
we use one variable, Weight, to predict Height…
Regression ts a line to the data that we
can use to make predictions.
Height
Height = 1.1 + 0.5 x Weight
Weight
80
fi
fi
ff
ff
fi
fi
fi
Beyond Linear Regression
As mentioned at the start Linear Models allow us to use discrete Just like when we used Weight to
of this chapter, Linear data, like whether or not someone loves predict Height, Linear Models will
1 Regression is the 2 the movie Troll 2, to predict something 3 give us an R2 for this prediction,
gateway to something continuous, like how many grams of which gives us a sense of how
called Linear Models, Popcorn they eat each day. accurate the predictions will be, and
which are incredibly a p-value that lets us know how
exible and powerful. much con dence we should have in
the predictions.
R2 = 0.66
Popcorn p-value = 0.04 In this case, the p-value of 0.04 is
(g) relatively small, which suggests that it
would be unlikely for random data to
Linear Models also easily combine give us the same result or something
4 discrete data, like whether or not
someone loves Troll 2, with
more extreme. In other words, we can
be con dent that knowing whether or
continuous data, like how much not someone loves Troll 2 will improve
Soda Pop they drink, to predict Loves Does Not
Troll 2 Love Troll 2 our prediction of how much Popcorn
something continuous, like how they will eat.
much Popcorn they will eat.
R2 = 0.97 In this case, adding how much Soda If you’d like to learn more
p-value = 0.006
Pop someone drinks to the model
dramatically increased the R2 value,
5 about Linear Models,
scan, click, or tap this QR
Popcorn which means the predictions will be
(g)
code to check out the
more accurate, and reduced the p- ‘Quests on YouTube!!!
value, suggesting we can have more
con dence in the predictions.
Soda Pop
DOUBLE
= Loves Troll 2 BAM!!!
= Does Not Love Troll 2
81
fl
fi
fi
fi
bam.
82
fi
Chapter 05
Gradient Descent!!!
Gradient Descent: Main Ideas
For example, there is no
The Problem: A major part of machine
1 learning is optimizing a model’s t to the data.
Sometimes this can be done with an analytical
analytical solution for Logistic
Regression (Chapter 6), which
solution, but it’s not always possible. ts an s-shaped squiggle to
data.
Likewise, there is no
=
analytical solution for
Neural Networks
(Chapter 12), which t
fancy squiggles to data.
BAM!!!
Weight Weight Weight
84
fi
fi
fi
fi
Gradient Descent: Details Part 1 NOTE: Even though there’s an analytical solution for Linear
Regression, we’re using it to demonstrate how Gradient
Descent works because we can compare the output from
Gradient Descent to the known optimal values.
Let’s show how Gradient Descent
1 ts a line to these Height and
Weight measurements.
Speci cally, we’ll show how Gradient Descent estimates
2 the intercept and the slope of this line so that we
minimize the Sum of the Squared Residuals (SSR).
Weight Weight
To keep things simple at the …and show how Gradient Once we understand how Gradient
3 start, let’s plug in the analytical Descent optimizes the intercept Descent optimizes the intercept, we’ll
solution for the slope, 0.64… one step at a time. show how it optimizes the intercept and
the slope at the same time.
85
fi
fi
Gradient Descent: Details Part 2
In this example, we’re tting a line to data, and Now, because we have 3 data
4 we can evaluate how well that line ts with the 5 points, and thus, 3 Residuals,
Sum of the Squared Residuals (SSR). the SSR has 3 terms.
Remember, Residuals are the SSR = (Observed Height1 - (intercept + 0.64 x Weight1))2
di erence between the Observed + (Observed Height2 - (intercept + 0.64 x Weight2))2
and Predicted values…
+ (Observed Height3 - (intercept + 0.64 x Weight3))2
Predicted Height
Residual = (Observed Height - Predicted Height) = (Height - (intercept + 0.64 x Weight))
Psst! Don’t
forget to read
Step 5!!!
Height X
X Predicted Height = intercept + 0.64 x Weight
X
Weight
86
ff
fi
fi
Gradient Descent: Details Part 3 Now, to calculate the SSR, we rst plug the value
7 for the y-axis intercept, 0, into the equation we
derived in Steps 4 and 5…
In this rst example, since
6 we’re only optimizing the y-
axis intercept, we’ll start by
assigning it a random value. In SSR = (Observed Height1 - (intercept + 0.64 x Weight1))2
this case, we’ll initialize the
intercept by setting it to 0. + (Observed Height2 - (intercept + 0.64 x Weight2))2
87
fi
fi
Feature Scaling: this is done to reach the global min quickly by smoothening
the curve → here we are ↓↓ the distance b/w parameters
Gradient Descent: Details Part 4 Eg: Unit Scaling: 0-255 is converted to value b/w 0-1
Use: Linear Regressing, K-means or KNN (Euclidean Distance used)
Types: Normalization & Standardization
Now, because the goal is to minimize the SSR, it’s a type of Loss TERMINOLOGY ALERT!!!
10 or Cost Function (see Terminology Alert on the right). In
Gradient Descent, we minimize the Loss or Cost Function by The terms Loss Function and Cost Function refer to
taking steps away from the initial guess toward the optimal value. anything we want to optimize when we t a model to
In this case, we see that as we increase the intercept, the x-axis data. For example, we want to optimize the SSR or
of the central graph, we decrease the SSR, the y-axis. the Mean Squared Error (MSE) when we t a
straight line with Regression or a squiggly line (in
Neural Networks). That said, some people use the
term Loss Function to speci cally refer to a function
(like the SSR) applied to only one data point, and use
s→ the term Cost Function to speci cally refer to a
i a l g ues function (like the SSR) applied to all of the data.
init
SSR Unfortunately, these speci c meanings are not
universal, so be aware of the context and be
prepared to be exible. In this book, we’ll use them
together and interchangeably, as in “The Loss or
Cost Function is the SSR.”
intercept
However, rather than just randomly trying a bunch of values …corresponds to this curve on a graph that has
11 for the y-axis intercept and plotting the resulting SSR on a
graph, we can plot the SSR as a function of the y-axis
the SSR on the
y-axis and the
intercept. In other words, this equation for the SSR… intercept on
the x-axis.
intercept
intercept
A relatively large value for the derivative, which
14 corresponds to a relatively steep slope for the tangent
15
A relatively small value for the
line, suggests we’re relatively far from the bottom of derivative suggests we’re relatively
the curve, so we should take a relatively large step… close to the bottom of the curve, so we
should take a relatively small step…
Step 2: Because the Residual links the intercept to Because of the subtraction, we
d SSR d SSR d Residual
the SSR, The Chain Rule tells us that the derivative = x can remove the parentheses
d intercept d Residual d intercept
of the SSR with respect to the intercept is… by multiplying everything
inside by -1.
Step 3: Use The Power Rule d Residual d
= Height - (intercept + 0.64 x Weight)
(Appendix E) to solve for the d intercept d intercept
two derivatives.
d
= Height - intercept - 0.64 x Weight
d SSR d d intercept
= (Residual)2 = 2 x Residual
d Residual d Residual
= 0 - 1 - 0 = -1
Because the rst and last terms do
not include the intercept, their
derivatives, with respect to the
d SSR d SSR d Residual intercept, are both 0. However, the
= x = 2 x Residual x -1 second term is the negative
Step 4: Plug the derivatives
d intercept d Residual d intercept
into The Chain Rule to get intercept, so its derivative is -1.
the nal derivative of the SSR
with respect to the intercept. = 2 x ( Height - (intercept + 0.64 x Weight) ) x -1
Multiply this -1 on
BAM!!! = -2 x ( Height - (intercept + 0.64 x Weight) )
the right by the 2 on
the left to get -2.
90
fi
fi
Gradient Descent: Details Part 7 Gentle Reminder: Because we’re using
Linear Regression as our example, we
don’t actually need to use Gradient
So far, we’ve calculated the derivative of However, we have
17 the SSR for a single observation. three observations
in the dataset, so
Descent to nd the optimal value for the
intercept. Instead, we could just set the
derivative equal to 0 and solve for the
the SSR and its intercept. This would be an analytical
SSR = ( Height - (intercept + 0.64 x Weight) )2 derivative both have
solution. However, by applying Gradient
three terms.
Descent to this problem, we can compare
d SSR the optimal value that it gives us to the
= -2 x ( Height - (intercept + 0.64 x Weight) )
d intercept analytical solution and evaluate how well
Gradient Descent performs. This will give
SSR = ( Height - (intercept + 0.64 x Weight) )2 us more con dence in Gradient Descent
when we use it in situations without
+ ( Height - (intercept + 0.64 x Weight) )2 analytical solutions like Logistic
Regression and Neural Networks.
+ ( Height - (intercept + 0.64 x Weight) )2
d SSR
= -2 x ( Height - (intercept + 0.64 x Weight) )
d intercept
+ -2 x ( Height - (intercept + 0.64 x Weight) )
91
fi
fi
fi
Terminology Alert!!! Parameters OH NO!!! MORE
TERMINOLOGY!!!
BAM!!!
92
Gradient Descent for One Parameter: Step-by-Step
First, plug the Observed values into the derivative …so that means plugging the Observed Weight
1 of the Loss or Cost Function. In this example, the and Height measurements into the derivative of
SSR is the Loss or Cost Function… the SSR.
Height
+ -2 x ( Height - (intercept + 0.64 x Weight) )
Weight
93
Gradient Descent for One Parameter: Step-by-Step
Now evaluate the derivative at the When we do the math, …thus, when the intercept = 0,
3 current value for the intercept. In we get -5.7… the slope of this tangent line is
this case, the current value is 0. -5.7.
95
Gradient Descent for One Parameter: Step-by-Step
7 After 7 iterations…
SSR BAM???
Not yet! Now let’s see how well
y-axis intercept
Gradient Descent optimizes the
intercept and the slope!
96
Optimizing Two or More Parameters: Details
Now that we know how to optimize the When we optimize
1 intercept of the line that minimizes the SSR, 2 two parameters, we
let’s optimize both the intercept and the slope. get a 3-dimensional
graph of the SSR.
Weight
This axis represents
di erent values for
the slope…
Just like before, the goal is to nd the parameter
3 values that give us the lowest SSR. And just like
before, Gradient Descent initializes the …the vertical axis
…and this axis
represents
parameters with random values and then uses is for the SSR… di erent values
derivatives to update those parameters, one step for the intercept.
at a time, until they’re optimal.
97
ff
ff
fi
Taking Multiple (Partial) Derivatives of the SSR: Part 1
The good news is that taking the derivative
1 of the SSR with respect to the intercept is
exactly the same as before.
We can use The Chain Rule to tell us how the
SSR changes with respect to the intercept.
Step 2: Because the Residual links the intercept to Because of the subtraction, we
d SSR d SSR d Residual
the SSR, The Chain Rule tells us that the derivative = x can remove the parentheses
d intercept d Residual d intercept
of the SSR with respect to the intercept is… by multiplying everything
inside by -1.
Step 3: Use The Power Rule to
d Residual d
= Height - (intercept + slope x Weight)
solve for the two derivatives. d intercept d intercept
d
= Height - intercept - slope x Weight
d SSR d d intercept
= (Residual)2 = 2 x Residual
d Residual d Residual
= 0 - 1 - 0 = -1
Because the rst and last terms do
not include the intercept, their
derivatives, with respect to the
d SSR d SSR d Residual intercept, are both 0. However, the
= x = 2 x Residual x -1 second term is the negative
Step 4: Plug the derivatives
d intercept d Residual d intercept
into The Chain Rule to get intercept, so its derivative is -1.
the nal derivative of the SSR
with respect to the intercept. = 2 x ( Height - (intercept + slope x Weight) ) x -1
Multiply this -1 on
the right by the 2 on
BAM!!! = -2 x ( Height - (intercept + slope x Weight) ) the left to get -2.
98
fi
fi
Taking Multiple (Partial) Derivatives of the SSR: Part 2
The other good news is that taking the derivative We can use The Chain Rule to tell us how
2 of the SSR with respect to the slope is very
similar to what we just did for the intercept.
the SSR changes with respect to the slope.
Step 2: Because the Residual links the slope to the d SSR d SSR d Residual Because of the subtraction, we
SSR, The Chain Rule tells us that the derivative of = x can remove the parentheses
d slope d Residual d slope
the SSR with respect to the slope is… by multiplying everything
inside by -1.
Step 3: Use The Power Rule to
d Residual d
= Height - (intercept + slope x Weight)
solve for the two derivatives. d slope d slope
d
= Height - intercept - slope x Weight
d SSR d d slope
= (Residual)2 = 2 x Residual
d Residual d Residual
= 0 - 0 - Weight = -Weight
Because the rst and second terms
do not include the slope, their
derivatives, with respect to the
d SSR d SSR d Residual slope, are both 0. However, the last
= x = 2 x Residual x -Weight term is the negative slope times
Step 4: Plug the derivatives
d slope d Residual d slope
into The Chain Rule to get Weight, so its derivative is -Weight.
the nal derivative of the SSR
with respect to the slope. = 2 x ( Height - (intercept + slope x Weight) ) x -Weight
Multiply this -Weight on
the right by the 2 on the
DOUBLE BAM!!! = -2 x Weight x ( Height - (intercept + slope x Weight) ) left to get -2 x Weight.
99
fi
fi
ff
Gradient Descent for Two Parameters: Step-by-Step
Height
….and one with d SSR = -2 x ( 3.2 - (intercept + slope x 2.9) )
respect to the slope. d intercept
+ -2 x ( 1.9 - (intercept + slope x 2.3) )
+ -2 x ( 1.4 - (intercept + slope x 0.5) )
Weight
d SSR
= -2 x Weight1 x ( Height1 - (intercept + slope x Weight1) )
d slope
+ -2 x Weight2 x ( Height2 - (intercept + slope x Weight2) )
+ -2 x Weight3 x ( Height3 - (intercept + slope x Weight3) )
Gentle Reminder: The
Weight and Height values
that we’re plugging into the d SSR
derivatives come from the = -2 x 2.9 ( 3.2 - (intercept + slope x 2.9) )
d slope
raw data in the graph.
+ -2 x 2.3 ( 1.9 - (intercept + slope x 2.3) )
+ -2 x 0.5 ( 1.4 - (intercept + slope x 0.5) )
100
Gradient Descent for Two Parameters: Step-by-Step
Now initialize the parameter, or parameters, that we
2 want to optimize with random values. In this example,
we’ll set the intercept to 0 and the slope to 0.5.
d SSR = -2 x ( 3.2 - (intercept + slope x 2.9) )
d intercept
+ -2 x ( 1.9 - (intercept + slope x 2.3) )
Height = intercept + slope x Weight
+ -2 x ( 1.4 - (intercept + slope x 0.5) )
d SSR
= -2 x 2.9 x ( 3.2 - (intercept + slope x 2.9) )
Weight d slope
+ -2 x 2.3 x ( 1.9 - (intercept + slope x 2.3) )
+ -2 x 0.5 x ( 1.4 - (intercept + slope x 0.5) )
d SSR
= -2 x 2.9 x ( 3.2 - (0 + 0.5 x 2.9) )
d slope
+ -2 x 2.3 x ( 1.9 - (0 + 0.5 x 2.3) )
+ -2 x 0.5 x ( 1.4 - (0 + 0.5 x 0.5) )
101
Gradient Descent for Two Parameters: Step-by-Step
Evaluate the derivatives at the
3 current values for the intercept, 0,
and slope, 0.5.
102
Gradient Descent for Two Parameters: Step-by-Step
TRIPLE BAM!!!
Gradient Descent is awesome,
but when we have a lot of data or a
This axis represents
lot of parameters, it can be slow. Is
di erent values for
there any way to make it faster?
the slope…
Height Height
Weight Weight
d SSR
= -2 x Weight x ( Height - (intercept + slope x Weight) )
Height d slope
105
fi
Stochastic Gradient Descent: Details Part 2
Then we just repeat the last 4 steps until the Step Sizes are
7 super small, which suggests we’ve optimized the
parameters, or until we reach a maximum number of steps.
BAM!
Pick a random point from Evaluate the derivatives at Calculate the Calculate the
a the dataset… b their current values… c Step Sizes… d new values…
Weight
106
fi
Gradient Descent: FAQ
When this happens (when we get stuck in a local
Will Gradient Descent always nd minimum instead of nding the global minimum), it’s
the best parameter values? a bummer. Even worse, usually it’s not possible to
graph the SSR, so we might not even know we’re in
Unfortunately, Gradient Descent does not always one, of potentially many, local minimums. However,
nd the best parameter values. For example, if the there are a few things we can do about it. We can:
graph of the SSR looked like this….
1) Try again using di erent random
numbers to initialize the parameters that
…then it’s possible that we we want to optimize. Starting with di erent
might get stuck at the bottom values may avoid a local minimum.
of this local minimum…
SSR 2) Fiddle around with the Step Size. Making it
a little larger may help avoid getting stuck in a
…instead of nding our way to local minimum.
the bottom and the global
3) Use Stochastic Gradient Descent,
minimum.
because the extra randomness helps avoid
parameter of interest getting trapped in a local minimum.
107
fi
fi
fi
ff
fi
fi
ff
Chapter 06
Logistic Regression!!!
Logistic Regression: Main Ideas Part 1
Does Not
Love Troll 2
The goal is to make a classi er that uses
the amount of Popcorn someone eats to
Popcorn (g) classify whether or not they love Troll 2.
A Solution: Logistic Regression, which probably should have been Like Linear Regression, Logistic
2 named Logistic Classi cation since it’s used to classify things, ts a Regression has metrics that are
squiggle to data that tells us the predicted probability (between 0 and 1) similar to R2 to give us a sense of
for discrete variables, like whether or not someone loves Troll 2. how accurate our predictions will be,
and it also calculates p-values.
1 = Loves Troll 2
4 If someone new
comes along and
…then the squiggle tells
us that there’s a relatively
tells us that they ate
this much Popcorn…
1 = Loves Troll 2
X high probability that that
person will love Troll 2.
Probability
that someone
Speci cally, the
loves Troll 2
corresponding y-axis value
on the squiggle tells us that
Does Not the probability that this
0=
Love Troll 2
Popcorn (g)
X person loves Troll 2 is 0.96.
Probability
that someone
loves Troll 2 TRIPLE BAM!!!
Does Not
In a few pages, we’ll talk
0= about how we t a squiggle
Love Troll 2
Popcorn (g)
X to Training Data. However,
before we do that, we need
to learn some fancy
terminology.
111
fi
fi
fi
fi
fi
fi
ff
Terminology Alert!!! Probability vs. Likelihood: Part 1
The Normal
Distribution’s maximum
More Likely likelihood value occurs
In this speci c example, at its mean.
the y-axis represents the
Likelihood of observing
any speci c Height.
Less Likely
Shorter Average Height Taller
112
fi
fi
ff
ff
Terminology Alert!!! Probability vs. Likelihood: Part 2
113
fi
fi
Terminology Alert!!! Probability vs. Likelihood: Part 3
Likelihoods are often used to evaluate how …and we wanted to compare the t …to the t of a Normal
Likelihood Likelihood
0.00 0.00
142.5 cm 155.7 cm 168.9 cm 142.5 cm 155.7 cm 168.9 cm
Probability
that someone = Loves Troll 2
Probability
that someone
X
loves Troll 2
X = Does Not
loves Troll 2
115
fi
fi
fi
Fitting A Squiggle To Data: Main Ideas Part 2
In contrast, calculating the Likelihoods is The good news is, because someone either loves
5 di erent for the people who Do Not Love Troll 2 or they don’t, the probability that someone
Troll 2 because the y-axis is the probability does not love Troll 2 is just 1 minus the probability
that they Love Troll 2. that they love Troll 2…
1 = Loves Troll 2
X
1 = Loves Troll 2
Probability
Probability
that someone
Bam!
that someone
loves Troll 2
loves Troll 2
Does Not
Does Not 0=
0= Love Troll 2
Love Troll 2
Popcorn (g)
Popcorn (g)
116
ff
fi
Fitting A Squiggle To Data: Main Ideas Part 3
Now that we know how to calculate the Likelihoods for
7 people who Love Troll 2 and people who Do Not Love Troll
2, we can calculate the Likelihood for the entire squiggle by
…and when we do the
math, we get 0.02.
multiplying the individual Likelihoods together…
Does Not
0=
Love Troll 2
Popcorn (g)
vs.
Now we calculate the Likelihood
8 for a di erent squiggle…
…and compare the total The goal is to nd the
Likelihoods for both
squiggles.
9 squiggle with the
1 = Loves Troll 2 Maximum Likelihood.
1 = Loves Troll 2
Does Not
0=
Love Troll 2
Popcorn (g)
that happens when you try to multiply a lot 3 errors is to just take the log (usually the
natural log, or log base e), which turns
of small numbers between 0 and 1. the multiplication into addition…
Technically, Under ow happens when a
mathematical operation, like multiplication, results
in a number that’s smaller than the computer is log(0.4 x 0.6 x 0.8 x 0.9 x 0.9 x 0.9 x 0.9 x 0.7 x 0.2)
capable of storing.
= log(0.4) + log(0.6) + log(0.8) + log(0.9) + log(0.9)
Under ow can result in errors, which are + log(0.9) + log(0.9) + log(0.7) + log(0.2)
bad, or it can result in weird,
unpredictable results, which are worse. = -4.0
…and, ultimately, turns a
number that was relatively close
to 0, like 0.02, into a number
relatively far from 0, -4.0.
118
fl
fl
fl
fl
Logistic Regression: Weaknesses
In contrast, if our
…and people who
1
When we use Logistic Regression, we assume that an s-shaped
squiggle (or, if we have more than one independent variable, an s- 2 data had people who
ate a little or a lot of
ate an intermediate
shaped surface) will t the data well. In other words, we assume amount did not, then
Popcorn and do not
that there’s a relatively straightforward relationship between the we would violate the
love Troll 2…
Popcorn and Loves Troll 2 variables: if someone eats very little assumption that
Popcorn, then there’s a relatively low probability that they love Logistic Regression
Troll 2, and if someone eats a lot of Popcorn, then there’s a makes about the
relatively high probability that they love Troll 2. data.
Probability Probability
that someone that someone
loves Troll 2 loves Troll 2
Naive Bayes!!!
Naive Bayes: Main Ideas
A Solution:
p( N ) x p( Dear | N ) x p( Friend | N ) 6
Whichever equation
gives us the larger
p( S ) x p( Dear | S ) x p( Friend | S ) value, which we’ll call a
score, tells us the nal
Then we compare that to
5 the Prior Probability, …multiplied by the probabilities of
classi cation.
another initial guess, that seeing the words Dear and Friend,
the message is Spam… given that the message is Spam.
121
fi
ff
fi
fi
ff
Multinomial Naive Bayes: Details Part 1
There are several types of Naive We start with Training …and 4 messages that
1 Bayes algorithms, but the most 2 Data: 8 messages that we we know are Spam.
commonly used version is called know are Normal…
Multinomial Naive Bayes.
p( Lunch | S ) = 0.00
For example, the probability that we
see the word Dear given that we see 2
it in Spam (S) messages…
p( Dear | S ) = 7
= 0.29 p( Money | S ) = 0.57
For example, because 8 of the 12 messages …and because 4 of the 12 messages are
8 are Normal, we let the Prior Probability for
Normal messages, p(N), be 8/12 = 0.67…
9 Spam, we let the Prior Probability for
Spam messages, p(S), be 4/12 = 0.33.
8 4
p( N ) = # of Normal Messages = = 0.67 p( S ) = # of Spam Messages = = 0.33
Total # of Messages 12 Total # of Messages 12
123
Multinomial Naive Bayes: Details Part 3
Dear
Now that we have the Prior …and the probabilities for each word Friend
10 Probability for Normal
messages…
occurring in Normal messages…
p( Dear | N ) = 0.47
…we can calculate the overall
p( N ) = 0.67 p( Friend | N ) = 0.29 score that the message, Dear
p( Lunch | N ) = 0.18 Friend, is Normal…
p( Money | N ) = 0.06
…by multiplying the Prior …by the probabilities of seeing …and when we do the
11 Probability that the
message is Normal…
the words Dear and Friend in math, we get 0.09.
Normal messages…
16 For example, if we
see this message… Money Money Money Money Lunch
Soda Pop
128
fi
fi
Gaussian Naive Bayes: Details Part 1
First we create Gaussian (Normal) curves for Starting with Popcorn, for the people who Do
1 each Feature (each column in the Training Not Love Troll 2, we calculate their mean, 4,
Then we draw a Gaussian
Data that we’re using to make predictions). and standard deviation (sd), 2, and then use
curve for the people who
those values to draw a Gaussian curve.
Love Troll 2 using their
Popcorn Soda Pop Candy
mean, 24, and their
(grams) (ml) (grams)
mean = 4
standard deviation, 4.
24.3 750.7 0.2 sd = 2
28.2 533.2 50.5
mean = 24
etc. etc. etc. sd = 4
4
p( Loves Troll 2 ) = # of People Who Love Troll 2 = = 0.6
Total # of People 4+3
6 NOTE: When we plug in the actual numbers… …we end up plugging in a tiny
number for the Likelihood of Candy,
p( Loves Troll 2 ) 0.6 because the y-axis coordinate is
super close to 0…
x L( Popcorn = 20 | Loves ) x 0.06
…and computers can have trouble
x L( Soda Pop = 500 | Loves ) x 0.004 doing multiplication with numbers very
x L( Candy = 100 | Loves ) x 0.000000000…001 close to 0. This problem is called
Under ow.
Lastly, because the score for Does Not Love …as someone who Does Not
9 Troll 2 (-48) is greater than the score for Loves Love Troll 2.
Troll 2 (-124), we classify this person…
DOUBLE BAM!!!
Log( Loves Troll 2 Score ) = -124
Now that we understand
Log( Does Not Love Troll 2 Score ) = -48 Multinomial and Gaussian Naive
Bayes, let’s answer some
Frequently Asked Questions.
132
fi
Naive Bayes: FAQ Part 1
If my continuous data are not Gaussian, can I still use Gaussian Naive Bayes?
Silly Bam. :)
133
fi
Naive Bayes: FAQ Part 2 What do you call it when we
mix Naive Bayes techniques
I don’t know, Norm.
Maybe “Naive
together? Bayes Deluxe!”
What if some of my data are discrete and some of my
data are continuous? Can I still use Naive Bayes?
Yes! We can combine the histogram approach from
Multinomial Naive Bayes with the distribution
approach from Gaussian Naive Bayes.
For example, if we had word counts from …and combine those with Likelihoods from
Normal messages and Spam, we could Exponential distributions that represent the
Normal use them to create histograms and amount of time that elapses between
probabilities… receiving Normal messages and Spam…
p( N ) x p( Dear | N ) x p( Friend | N )
p( N | Dear Friend) =
p( N ) x p( Dear | N ) x p( Friend | N ) + p( S ) x p( Dear | S ) x p( Friend | S )
p( S ) x p( Dear | S ) x p( Friend | S )
p( S | Dear Friend) =
p( N ) x p( Dear | N ) x p( Friend | N ) + p( S ) x p( Dear | S ) x p( Friend | S )
135
fi
Chapter 08
Assessing Model
Performance!!!
Assessing Model Performance: Main Ideas
The Problem: We’ve got this dataset, and we
Good
Chest Blocked Heart
Blood Weight Weight
Pain Arteries Disease
Circ.
No No No 125 No
A Solution: Rather than worry too much upfront about …to Receiver Operator Curves (ROCs),
2 which model will be best, we can try both and assess which give us an easy way to evaluate
their performance using a wide variety of tools… how each model performs with di erent
classi cation thresholds.
Predicted
Has Heart Does Not Have …from Confusion 1
Disease Heart Disease Matrices, simple grids
Has Heart that tell us what each
Disease 142 22 model did right and what True
BAM!!!
Actual
Good
Chest Blocked Heart Good
Blood Weight Chest Blocked Heart
Pain Arteries Disease …and we use the rst Blood Weight
Circ. Pain Arteries Disease
block to optimize Circ.
No No No 125 No both models using Yes Yes No 210 No
Yes Yes Yes 180 Yes Cross Validation.
… … … … …
Yes Yes No 210 No
… … … … …
We then build a Confusion Matrix for the
Then we apply the optimized Naive Bayes model to the
4 optimized Logistic Regression model…
Actual
Yes
…2) the number of people No 29 115
with Heart Disease who were
Predicted incorrectly predicted to not …and, at a glance, we can see that
Has Heart Does Not Have have Heart Disease… Logistic Regression is better at
Disease Heart Disease predicting people without Heart
Disease and Naive Bayes is better
Has Heart at predicting people with it.
Disease 142 22 …3) the number of people
Actual
Yes No
139
The Confusion Matrix: Details Part 2
…then the corresponding Confusion
When there are only two possible
2 outcomes, like Yes and No…
Matrix has 2 rows and 2 columns: one
each for Yes and No.
Good Predicted
Chest Blocked Heart
Blood Weight Has Heart Does Not Have
Pain Arteries Disease
Circ. Disease Heart Disease
No No No 125 No Has Heart
Disease 142 22
Actual
Yes Yes Yes 180 Yes
Does Not Have
Yes Yes No 210 No Heart Disease
29 110
… … … … …
When there are 3 possible outcomes, like in this …then the corresponding Confusion In general, the
3 dataset that has 3 choices for favorite movie, Matrix has 3 rows and 3 columns. 4 size of the
matrix
Troll 2, Gore Police, and Cool as Ice…
corresponds to
the number of
Predicted classi cations
Run for Howard
Jurassic Out Favorite we want to
Your the
Park III
Wife
Kold
Duck
Movie predict.
Troll 2 Gore Police Cool As Ice
… … … … …
Cool as Ice .. .. …
140
fi
The Confusion Matrix: Details Part 3
Unfortunately, there’s no standard for how a In other cases, the rows re ect
5 Confusion Matrix is oriented. In many cases, the
rows re ect the actual, or known, classi cations
the predictions and the
columns represent the actual,
So, make sure you read
the labels before you
interpret a Confusion
and the columns represent the predictions. or known, classi cations.
Matrix!
Predicted Actual
Has Heart Does Not Have Has Heart Does Not Have
Disease Heart Disease Disease Heart Disease
Predicted
Disease Negatives Disease Positives
Actual
Positives Positives
Does Not Have False True Does Not Have False True
Heart Disease Positives Negatives Heart Disease Negatives Negatives
141
fl
fi
fl
fi
The Confusion Matrix: Details Part 4 Gentle Reminder:
Predicted
True Positive False
Yes No Negative
Lastly, in the original example, when we …because both matrices
6 TP FN
Actual
Yes
compared the Confusion Matrices for contained the same number
Naive Bayes and Logistic Regression… of False Negatives (22 No FP TN
True
each) and False Positives False Positive Negative
(29 each)…
Actual
Yes Yes
Regression was at predicting people without Heart Disease, all we
No 29 110 No 29 115 needed to do was compare the True Negatives (110 vs. 115).
However, what if we had ended up with We can still see that Naive Bayes is better at predicting people with
7 di erent numbers of False Negatives
and False Positives?
Heart Disease, but quantifying how much better is a little more
complicated because now we have to compare True Positives (142
vs. 139) and False Negatives (29 vs. 32).
Actual
Yes Yes
better has to take True Negatives (110 vs. 112) and False
No 22 110 No 20 112 Positives (22 vs. 20) into account.
Actual
Yes
1 correctly classi es the actual Positives, in this case, the known
people with Heart Disease, we calculate Sensitivity, which is the No FP TN
True
percentage of the actual Positives that were correctly classi ed.
False Positive Negative
True Positives
Sensitivity = Predicted
True Positives + False Negatives For example, using the Heart Disease
Naive Bayes Yes No
data and Confusion Matrix, the
Sensitivity for Naive Bayes is 0.83… 142 29
Actual
Yes
Predicted
No 22 110
Yes No TP 142
Sensitivity = = = 0.83
TP FN TP + FN 142 + 29
Actual
Yes
…which means that 83% of the people with
FP TN
BAM!!!
No
Heart Disease were correctly classi ed.
Actual
Yes
True Negatives For example, using the Heart Disease data
Speci city = No 20 112
True Negatives + False Positives and Confusion Matrix, the Speci city for
Logistic Regression is 0.85…
Predicted TN 115
Speci city = = = 0.85
Yes No
TN + FP 115 + 20
TP FN DOUBLE about
Now let’s talk
Actual
Yes
…which means that 85% of the people without
Precision
No FP TN Heart Disease were correctly classi ed.
BAM!!! and Recall. 143
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Precision and Recall: Main Ideas Gentle Reminder:
Predicted
True Positive False
Yes No Negative
Precision is another metric that can summarize a
1 TP FN
Actual
Yes
Confusion Matrix. It tells us the percentage of the
predicted Positive results (so, both True and False No FP TN
True
Positives) that were correctly classi ed.
False Positive Negative
True Positives Predicted
Precision = For example, using the Heart Disease Naive Bayes
True Positives + False Positives
data and Confusion Matrix, the Yes No
Precision for Naive Bayes is 0.87… 142 29
Actual
Yes
Predicted
No 22 110
Yes No TP 142
Sensitivity = = = 0.87
TP FN TP + FP 142 + 22
Actual
Yes
No FP TN …which means that of the 164 people that we predicted to have Heart Disease, 87%
actually have it. In other words, Precision gives us a sense of the quality of the
positive results. When we have high Precision, we have high-quality positive results.
Yes
No FP TN
144
fi
fi
fi
ff
Scan, click, or tap
this QR code to
The Sensitivity, Specificity, Precision, Recall Song!!!
hear the Silly
Song!!! D Major G A G A
#
#
F# minor G F# minor G
#
#
Pre- ci-sion is some-thing dif-fer-ent It’s the percentage of pre-dic- ted po-si -tives
A G A G A
#
# Now go back
to the top and
repeat!!!
correctly predicted and recall gets us back to the start it’s the same as…
Actual
Yes
True Positive Rate = Recall = Sensitivity =
True Positives + False Negatives No FP TN
True
False Positive Negative
Predicted
Yes No
TP FN
Actual
Yes
No FP TN
TP FN
Actual
Yes
No FP TN NOTE: Remember, Speci city is the proportion of
actual Negatives that were correctly classi ed, thus… BAM!!!
False Positive Rate = 1 - Speci city
…and…
Speci city = 1 - False Positive Rate 146
fi
fi
fi
fi
fi
fi
fi
fi
fi
ROC: Main Ideas Part 1
In Chapter 6, we used Logistic Regression to predict
1 whether or not someone loves Troll 2, and we mentioned
that usually the classi cation threshold is 50%…
147
fi
fi
ROC: Main Ideas Part 2
Now let’s talk about what happens when we For example, if it was super
3 use a di erent classi cation threshold for
deciding if someone Loves Troll 2 or not.
important to correctly classify
every single person who Loves
Troll 2, we could set the
threshold to 0.01.
1 = Loves Troll 2
1 = Loves Troll 2
Predicted
Yes No
Probability …but we also increase
5 0
Actual
148
ff
fi
fi
fi
ROC: Main Ideas Part 3
On the other hand, if it was super important to
5 correctly classify everyone who Does Not Love Troll
2, we could set the classi cation threshold to 0.95… Gentle Reminder:
Predicted
True Positive False
1 = Loves Troll 2
Yes No Negative
TP FN
Actual
Yes
Probability
that someone No FP TN
loves Troll 2 True
False Positive Negative
Does Not
0=
Love Troll 2
Popcorn (g)
1 = Loves Troll 2
Popcorn (g) No 0 4
149
fi
fi
fi
ROC: Main Ideas Part 4
We could also set the classi cation
7 threshold to 0, and classify everyone
…and that would give us
this Confusion Matrix.
as someone who Loves Troll 2…
Predicted
1 = Loves Troll 2
Yes No
5 0
Actual
Probability Yes
that someone
loves Troll 2
No 4 0
Does Not
0=
Love Troll 2
Popcorn (g)
Predicted
1 = Loves Troll 2
Yes No
0 5
Actual
Probability Yes
that someone
loves Troll 2
No 0 4
Does Not
0=
Love Troll 2
Popcorn (g)
150
fi
fi
ROC: Main Ideas Part 5
Ultimately, we can try any classi cation …and when we do, we end up with 10 di erent Confusion Matrices that we
9 threshold from 0 to 1… can choose from. (NOTE: The threshold under each Confusion Matrix is
just one of many that will result in the exact same Confusion Matrix).
Predicted Predicted
Actual
5 0 Yes
Actual
Yes
Probability 4 No 3 1
No 0
that someone
loves Troll 2 Threshold = 0 Threshold = 0.007
Predicted Predicted
Does Not
0= Yes No Yes No
Love Troll 2
5 0 5 0
Actual
Yes
Actual
Popcorn (g) Yes
No 2 2 No 1 3
Threshold = 0.0645 Threshold = 0.195
Actual
Actual
Yes Yes Yes
No 1 3 No 1 3 No 0 4 Matrices is tedious and annoying.
Wouldn’t it be awesome if we
Threshold = 0.5 Threshold = 0.87 Threshold = 0.95 could consolidate them into one
easy-to-interpret graph?
Predicted Predicted Predicted
YES!!!
Yes No Yes No Yes No
2 3 1 4 0 5
Actual
Actual
Actual
0 1
(or 1 - Speci city)
Now that we understand the
c …and the further to the left along the x-axis, the lower the
11 main ideas behind ROC graphs,
let’s dive into the details!!!
percentage of actual Negatives that were incorrectly classi ed.
152
fi
fi
fi
fi
fi
fi
fi
fi
ROC: Details Part 1 Gentle Reminder:
Predicted
True Positive False
Yes No Negative
To get a better sense of how an ROC graph works, let’s draw one
1 TP FN
Actual
from start to nish. We’ll start by using a classi cation threshold, 1, Yes
that classi es everyone as someone who Does Not Love Troll 2… No FP TN
True
False Positive Negative
1 = Loves Troll 2
…and when the classi cation
Probability threshold is set to 1, we end up
Predicted
that someone with this Confusion Matrix.
loves Troll 2 Yes No
0 5
Actual
Yes
0=
Does Not No 0 4
Love Troll 2
Threshold = 1
Popcorn (g)
True Positives 1
True Positive Rate =
True Positives + False Negatives
0
= =0 True Positive
0+5
Rate
(or Sensitivity
False Positives
False Positive Rate =
False Positives + True Negatives 0
0 False Positive Rate
= =0 0 1
0+4 (or 1 - Speci city)
153
fi
fi
fi
fi
fi
ROC: Details Part 2 Gentle Reminder:
Predicted
True Positive False
Yes No Negative
Now let’s lower the classi cation threshold to
5 TP FN
Actual
0.975, which is just enough to classify one Yes
person as someone who Loves Troll 2… No FP TN
True
False Positive Negative
1 = Loves Troll 2
…and everyone else is
classi ed as someone who Predicted
Probability Does Not Love Troll 2…
that someone Yes No
loves Troll 2 1 4
Actual
Yes
No 0 4
…and that gives us this
Does Not
0=
Love Troll 2 Confusion Matrix. Threshold = 0.975
Popcorn (g)
True Positives 1
True Positive Rate = …and the new point is
True Positives + False Negatives
above the rst point,
1 showing that the new
= = 0.2 True Positive threshold increases the
1+4
Rate
proportion of actual
(or Sensitivity
Positives that were
7 …and the False Positive Rate… or Recall) correctly classi ed.
BAM!!!
False Positives
False Positive Rate =
False Positives + True Negatives 0
0 False Positive Rate
= =0 0 1
0+4 (or 1 - Speci city)
154
fi
fi
fi
fi
fi
ROC: Details Part 3 Gentle Reminder:
Predicted
True Positive False
Yes No Negative
Now let’s lower the classi cation threshold to
9 TP FN
Actual
0.965, which is just enough to classify 2 Yes
people as people who Love Troll 2… No FP TN
True
False Positive Negative
1 = Loves Troll 2
…and everyone else is
classi ed as someone who Predicted
Probability Does Not Love Troll 2…
that someone Yes No
loves Troll 2 2 3
Actual
Yes
No 0 4
…and that gives us this
Does Not
0=
Love Troll 2 Confusion Matrix. Threshold = 0.965
Popcorn (g)
True Positives 1
True Positive Rate = …and the new point is
True Positives + False Negatives
above the rst two
2 points, showing that the
= = 0.4 True Positive new threshold increases
2+3
Rate
the proportion of actual
(or Sensitivity
Positives that were
11 …and the False Positive Rate… or Recall) correctly classi ed.
False Positives
False Positive Rate =
False Positives + True Negatives 0
0 False Positive Rate
= =0 0 1
0+4 (or 1 - Speci city)
155
fi
fi
fi
fi
fi
ROC: Details Part 4
Likewise, for each threshold that increases the number of Positive
13 classi cations (in this example, that means classifying a person as
someone who Loves Troll 2), we calculate the True Positive Rate 1
and False Positive Rate until everyone is classi ed as Positive.
Predicted
1
Yes No
0
3 2
Actual
Yes
No 0 4
0 Threshold = 0.95 1
1
1
0
True Positive
Rate
0 (or Sensitivity
or Recall) 1
Predicted
Yes No
3 2 0
Actual
Yes
No 1 3 False Positive Rate
0
0 1
(or 1 - Speci city)
Threshold = 0.87
1 1
0 0
156
fi
fi
fi
ROC: Details Part 5
After we nish plotting the points from …and add a diagonal line that tells
14 each possible Confusion Matrix, we
usually connect the dots…
us when the True Positive Rate =
False Positive Rate.
True Positive
Rate
16
ROC graphs are great for selecting
an optimal classi cation threshold
(or Sensitivity
or Recall)
BAM!!!
for a model. But what if we want to
compare how one model performs 0
vs. another? This is where the AUC,
False Positive Rate
UGH!!!
0 0
1 1
…and because Logistic
Regression has a larger
In this case, the AUC, we can tell that,
True True
Positive AUC for Logistic Positive overall, Logistic Regression
Rate Regression is 0.9… Rate performs better than Naive
Bayes with these data.
0 0
BAM!!!
0 False Positive Rate 1 0 False Positive Rate 1
158
AUC: FAQ How do you calculate the AUC? AUC = 0.85
1 0.200
2
0.025
+ 0.025
1
1 AUC = Total Area = 0.850
Area = 0.25 x 0.2 = 0.025
2
0.8
0.6
True Area = 0.5 x 1 = 0.5
Positive
Rate
0.4
0.2
160
Precision Recall Graphs: Main Ideas Part 1
An ROC graph uses the False Positive Rate on the x-axis, and
1 this is ne when the data are balanced, which in this case means
there are similar numbers of people who Love Troll 2 and who
Do Not Love Troll 2.
1
1 = Loves Troll 2
= Does Not
Probability Love Troll 2 True
that someone Positive
loves Troll 2 Rate
0 0
However, when the data are imbalanced, and we …then the ROC graph is hard to interpret because
2 have way more people who Do Not Love Troll 2
(which wouldn’t be surprising since it consistently
the False Positive Rate barely budges above 0
before we have a 100% True Positive Rate.
wins “worst movie ever” awards)…
loves Troll 2 3 0
Actual
Yes Rate
1 The good news is that we
No 200
have Precision Recall
0 Threshold = 0.5 graphs that attempt to
0 deal with this problem.
Popcorn (g) Read on for details!!!
0 False Positive Rate 1
161
fi
Precision Recall Graphs: Main Ideas Part 2
A Precision Recall graph simply replaces the False Positive With Precision on the x-axis, good classi cation
3 Rate on the x-axis with Precision and renames the y-axis
Recall since Recall is the same thing as the True Positive Rate.
thresholds are closer to the right side, and now we
can clearly see a bend where the classi cation
thresholds start giving us a lot of False Positives.
1 1
1
Probability
Recall
or
Sensitivity)
0
0 0
Popcorn (g)
0 False Positive Rate 1 Precision
0 1
Yes
Yes No Negative
TP FN No 1 200
Actual
Yes
No FP TN Threshold = 0.5
True
False Positive Negative
162
fi
fi
Hey Norm, now that we
understand how to summarize
how well a model performs using
Confusion Matrices and ROC
graphs, what should we learn
about next?
163
Chapter 09
Preventing Overfitting
with Regularization!!!
Regularization: Main Ideas
The Problem: The fancier In other words, this …but the predictions
1 and more exible a machine
learning method is, the
squiggle might t the
Training Data really well…
made with New Data
are terrible.
easier it is to over t the
Training Data.
Using technical jargon, we
would say that the squiggle
has low Bias because it ts
the Training Data well, but
Height Height
high Variance because it
does a bad job with New
Data.
Height
Height
Height
Weight
…and as you can see, the new line
doesn’t t the Training Data perfectly,
Weight
Weight but it does better with the Testing Data.
Read on to learn how it does this.
166
fi
fi
ff
fi
Ridge/Squared/L2 Regularization: Details Part 2
In contrast, when we use Ridge Regularization to
7 When we normally t a line to Training Data,
we want to nd the y-axis intercept…
8 optimize parameters, we simultaneously minimize the
SSR and a penalty that’s proportional to the square of
the slope…
…and the slope that
minimize the SSR.
SSR + λ x slope2
Height Height = intercept + slope x Weight
Height
167
fi
fi
fi
fi
fi
ff
Ridge/Squared/L2 Regularization: Details Part 3
Now let’s calculate the Ridge Score for …so we plug in the SSR, 0.4, the value for λ
11 the new line that doesn’t t the Training 12 (lambda), which for now is 1, and the slope,
Data as well. It has an SSR = 0.4… 0.6, into the Ridge Penalty.
And when we
do the math,
we get 0.76.
SSR + λ x slope2 = 0.4 + 1 x 0.62 = 0.76
Height = 1.2 + 0.6 x Weight
Height
Weight
…and because we want to
Height
pick the line that minimizes
the Ridge Score, we pick SSR + λ x slope2
the new line.
Weight
168
fi
fi
fi
fi
Ridge/Squared/L2 Regularization: Details Part 4
However, as we just saw, when λ = 1,
15 When λ = 0… …and, as a result, the 16 we get a new line that has a smaller
new line, derived from slope than when we only minimized
Ridge Regularization, is the SSR.
…then the whole no di erent from a line
SSR + λ x slope2
Ridge Penalty is that minimizes the SSR.
also 0…
= SSR + 0 x slope2
= SSR + 0 λ=1
…and that means we’ll Height
= SSR only minimize the SSR, λ=0
so it’s as if we’re not Height
using Ridge
Regularization…
Weight
Weight
17
When we increase λ
to 2, the slope gets 18 …and when we increase
λ to 3, the slope gets
19 gets closer and closer to 0 and the y-axis
intercept becomes the average Height in the
even smaller… even smaller… Training Dataset (1.8). In other words, Weight
no longer plays a signi cant role in making
predictions. Instead, we just use the mean
Height.
λ=2 λ=3
Height Height Height So, how do we
pick a good
value for λ?
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
For example, if Weight and Shoe Size are both …compared to the slope
23 useful for predicting Height, but Airspeed of a associated with the Airspeed of
Swallow is not, then their slopes, slopew and a Swallow, slopea, which will
slopes, will shrink a little bit… shrink a lot.
https://web.stanford.edu/~hastie/TALKS/nips2005.pdf
172
fi
ff
fi
fi
fi
Lasso/Absolute Value/L1 Regularization: Details Part 1
1
Lasso Regularization, also called Absolute Value
or L1 Regularization, replaces the square that we 2 For example, let’s compare the Lasso
scores for the black line, which ts the
use in the Ridge Penalty with the absolute value. Training Data perfectly…
173
fi
fi
fi
Lasso/Absolute Value/L1 Regularization: Details Part 2
The big di erence between Ridge and Lasso Regularization is that Ridge
5 Regularization can only shrink the parameters to be asymptotically close to
0. In contrast, Lasso Regularization can shrink parameters all the way to 0.
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
Height = intercept + slopew x Weight + slopes x Shoe Size + slopea x Airspeed of a Swallow
DOUBLE BAM!!!
175
Ridge vs. Lasso Regularization: Details Part 1
As always, we’ll start with a super
The critical thing to know about Ridge vs. Lasso 2 simple dataset where we want to
1 Regularization is that Ridge works better when most use Weight to predict Height.
of the variables are useful and Lasso works better
when a lot of the variables are useless, and Ridge and
Lasso can be combined to get the best of both worlds.
That said, people frequently ask why only Lasso can
set parameter values (e.g, the slope) to 0 and Ridge Height
cannot. What follows is an illustration of this di erence,
so if you’re interested, read on!!!
Weight
20
20
SSR
+ 15
15
Height
Height λ x slope2
RSS
10
10
5
5
Weight
Weight
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
176
fi
fi
ff
Ridge vs. Lasso Regularization: Details Part 2
Now let’s increase the slope to 0.2
4 and plot the SSR + λ x slope2,
again with λ = 0…
Ultimately, we can plot the
Height
20
20
SSR
+ 15
15
Weight λ x slope2
RSS
10
10
Height
slope2, with λ = 0…
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Weight
Slope Values
Intercept
…and plot the SSR + λ x
slope2, with λ = 0 for when
Height the slope is 0.8.
Weight Height
…and plot the
SSR + λ x slope2,
with λ = 0 for when
the slope is 0.6…
Weight
177
Ridge vs. Lasso Regularization: Details Part 3
Now that we have this blue curve, we can see
5 that the value for the slope that minimizes the
Ridge Score is just over 0.4…
…which corresponds to
this blue line.
Ridge Regression makes me
think of being at the top of a cold
mountain. Brrr!! And Lasso
20 Regression makes me think
20
λ x slope2 Height
RSS
10
10
λ=0
5
5
Weight
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
λ x slope2
λ = 10
Height
RSS
10
10
Height λ=0
5
5
Weight
Weight −0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
178
ff
Ridge vs. Lasso Regularization: Details Part 4
When we compare the blue Likewise, we can see that when λ = 10,
7 line, which has the optimal
slope when λ = 0…
8 the lowest point on the orange curve
corresponds to a slope value closer to 0…
20
the optimal slope when λ = 10, we
see that λ = 10, which increases SSR …than on the
Height blue curve,
the Ridge Penalty, results in a + 15
15
smaller slope. λ x slope2 when λ = 0.
λ = 10
RSS
λ = 10
10
10
λ=0
Weight
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
9 Now we calculate the SSR + λ x slope2 for a
Intercept
20 λ = 20
…and we get this 20
SSR
green curve, and + 15
15
corresponds to a 10
10
Weight
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
179
ff
Ridge vs. Lasso Regularization: Details Part 5
20
20
10 Lastly, we calculate the SSR + λ x slope2 for
a bunch of di erent slopes with λ = 40…
SSR
λ = 40
+ 15
15
λ x slope2
…and we get this purple
RSS
curve, and again, we see 10
10
that the lowest point
Height corresponds to a slope
that’s closer to, but not 5
5
quite, 0.
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
20
we get a nice curve for di erent values left with the SSR. And, just
for the slope. like before, we see the SSR
lowest point in the blue + 15
15
2) When we increase λ, the lowest point curve corresponds to a λ x |slope|
in the curve corresponds to a slope
RSS
slope value that’s just over
10 λ=0
10
value closer to 0, but not quite 0. 0.4.
BAM!!! 5
5
Now let’s do the same thing using the −0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
SSR + λ x |slope|!!!
180
ff
ff
ff
Ridge vs. Lasso Regularization: Details Part 6
Now when we calculate the SSR + λ x |slope| and let When we calculate the SSR + λ x |slope| and let
20
20
20
20
SSR λ = 20
+ 15 λ = 10 However, unlike before, this SSR
15
15
λ x |slope|
kink when the slope is 0. λ x |slope|
RSS
10
RSS
10
10
10
5
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
λ x |slope| Height
RSS
10
10
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
BAM!!!
Slope Values
Intercept
181
ff
ff
ff
Ridge vs. Lasso Regularization: Details Part 7
In summary, when we increase the Ridge, In contrast, when we increase the Lasso, Absolute
16 Squared the optimal value for the slope shifts 17 Value, or L1 Penalty, the optimal value for the slope
shifts toward 0, and since we have a kink at 0, 0
toward 0, but we retain a nice parabola shape,
and the optimal slope is never 0 itself. ends up being the optimal slope.
20 λ = 40 20
20
20
λ = 40
SSR λ = 20 SSR λ = 20
+ 15 + 15
15
15
λ x slope2 λ x |slope| λ = 10
λ = 10
RSS
RSS
10 10
10
10
λ=0
λ=0
5 5
5
5
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0 −0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
Slope Values
Intercept
NOTE: Even if we
increase λ all the 20 λ = 40
TRIPLE
20
BAM!!!
λ = 400
Ridge Penalty gives SSR λ = 20
us a smooth red + 15
15
lowest point
10
10
corresponds to a
λ=0
slope value slightly Now let’s learn
greater than 0. 5 about Decision
5
Trees!
−0.2 0
0.0 0.2 0.4
0.4 0.6 0.8
0.8 1.0
Slope Values
Intercept
182
Chapter 10
Decision Trees!!!
Classification and Regression Trees: Main Ideas
Classi cation Trees classify
2 people or things into two or more Loves Soda
discrete categories. This example Yes No
Classi cation Tree classi es
There are two types of Trees in people into two groups:
1 machine learning: trees for Age < 12.5 Does Not
Love Troll 2
Classi cation and trees for Yes No
Regression.
184
ff
fi
fi
fi
fi
fi
fi
Terminology Alert!!! Decision Tree Lingo
The good news is that Decision
1 Trees are pretty simple, and there’s
One thing that’s weird about Decision
Trees is that they’re upside down! The
not too much terminology. roots are on top, and the leaves are
on the bottom!
185
fi
Decision Trees
Part One:
186
fi
Classification Trees: Main Ideas Unfortunately, we can’t use Logistic Regression
with these data because when we plot Age vs.
The Problem: We have a mixture of loves Troll 2, we see that tting an s-shaped
1 discrete and continuous data… squiggle to the data would be a terrible idea: both
young and old people do not love Troll 2, with the
people who love Troll 2 in the middle. In this
example, an s-shaped squiggle will incorrectly
Loves Loves Loves …that we want to use
Age classify all of the older people.
Popcorn Soda Troll 2 to predict if someone
loves Troll 2, the 1990
Yes Yes 7 No blockbuster movie 1 = Loves Troll 2
Yes No 12 No that was neither
about trolls nor a Probability
No Yes 18 Yes that someone
sequel.
loves Troll 2
No Yes 35 Yes
Yes Yes 38 Yes Does Not
0=
Yes No 50 No Love Troll 2
No No 83 No Age
1 build a Classi cation Tree that uses 2 Popcorn, Loves Soda, or Age should be the
question we ask at the very top of the tree.
Loves Popcorn, Loves Soda, and Age…
Loves Loves
Age
Loves …to predict ???
Popcorn Soda Troll 2 whether or
Yes Yes 7 No not someone
will love Troll
Yes No 12 No 2.
No Yes 18 Yes
No Yes 35 Yes
Yes Yes 38 Yes
Yes No 50 No
No No 83 No For example, the rst person in the
4 Training Data loves Popcorn, so
they go to the Leaf on the left…
…and because
Loves Loves Loves they do not love
To make that decision, we’ll start by looking Age Troll 2, we’ll put a
3 at how well Loves Popcorn predicts whether
or not someone loves Troll 2…
Popcorn Soda Troll 2
1 under the word
Yes Yes 7 No No.
Loves Popcorn
Loves Popcorn Yes No
…by making a super
Yes No simple tree that only
asks if someone loves Loves Troll 2 Loves Troll 2
Loves Troll 2 Loves Troll 2 Popcorn and running Yes No Yes No
Yes No Yes No the data down it.
1
188
fi
fi
fi
Building a Classification Tree: Step-by-Step
The second person also loves The third person does not love
5 Popcorn, so they also go to the Leaf
on the left, and because they do not
6 Popcorn, so they go to the Leaf
Loves Popcorn on the right, but they love Troll 2,
love Troll 2, we increase No to 2. so we put a 1 under the word Yes.
Yes No
Loves Loves Loves
Age
Popcorn Soda Troll 2
Loves Troll 2 Loves Troll 2
Yes Yes 7 No Yes No Yes No
Yes No 12 No 2 1
No Yes 18 Yes
No Yes 35 Yes
Yes Yes 38 Yes
Yes No 50 No
No No 83 No Loves Loves Loves
Age
Popcorn Soda Troll 2
Loves Popcorn
Yes No
BAM!!! Loves Soda
Yes No
Loves Troll 2 Loves Troll 2
Yes No Yes No Loves Troll 2 Loves Troll 2
1 3 2 1 Yes No Yes No
3 1 0 3
189
Building a Classification Tree: Step-by-Step
Now, looking at the two little TERMINOLOGY ALERT!!!
9 trees, one for Loves Popcorn Loves Popcorn
Leaves that contain mixtures of
and one for Loves Soda… Yes No
classi cations are called Impure.
190
fi
ff
Building a Classification Tree: Step-by-Step
To calculate the Gini Impurity
12 for Loves Popcorn, rst we Loves Popcorn
Yes No
calculate the Gini Impurity for
each individual Leaf. So, let’s
start by plugging the numbers Loves Troll 2 Loves Troll 2
from the left Leaf into the Yes No Yes No
equation for Gini Impurity.
1 3 2 1
Gini
13 Impurity = 1 - (the probability of “yes”)2 - (the probability of “no”)2
for a Leaf
(The total for the Leaf( (The total for the Leaf(
For the Leaf on the left, The number for Yes 2 The number for No 2
14 when we plug the
numbers, Yes = 1, No = 3,
=1- -
191
fi
Building a Classification Tree: Step-by-Step
Total Gini
17 Impurity
= weighted average of Gini Impurities for the Leaves
Impurity ( 1 + 6 ( (1 + 6(
Troll 2 Total Gini 1 6
Yes No = 0.0 + 0.5
7 No
9.5
12 No Loves Troll 2 Loves Troll 2
15 Yes No Yes No = 0.429
18 Yes
26.5 0 1 3 3
35 Yes
36.5
38 Yes
44
50 No
66.5
83 No
194
fi
Building a Classification Tree: Step-by-Step
Now remember: our rst goal was to
25 determine whether we should ask about
Loves Popcorn, Loves Soda, or Age at
26
…so we calculated the Gini
Impurities for each feature… Loves Popcorn
the very top of the tree… Yes No
Loves Soda
Yes No
BAM!!!
195
fi
Building a Classification Tree: Step-by-Step
With Loves Soda at the top of the tree, the 4 …and the 3 people who do not love
28 people who love Soda, including 3 who love Troll 2
and 1 who does not, go to the Node on the left…
29 Soda, all of whom do not love Troll 2,
go to the Node on the right.
Loves Popcorn Age < 12.5 Age < 12.5 Loves Troll 2
Yes No
Yes No Yes No
0 3
Loves Troll 2 Loves Troll 2 Loves Troll 2 Loves Troll 2 Loves Troll 2 Loves Troll 2
Yes No Yes No Yes No Yes No Yes No Yes No
1 1 2 0 0 1 3 0 0 1 3 0
198
fi
fi
fi
Building a Classification Tree: Step-by-Step
When we built this tree, only one person in
38 the Training Data made it to this Leaf…
…and because so few people in the Training Data made
it to that Leaf, it’s hard to have con dence that the tree
will do a great job making predictions with future data.
→ Soda is a bottle-neck
Loves Loves Loves
Age Loves Soda → lowest Gini value
Popcorn Soda Troll 2
Yes Yes 7 No
Age < 12.5 Loves Troll 2
Yes No 12 No
Yes No
No Yes 18 Yes
0 3
No Yes 35 Yes
Loves Troll 2 Loves Troll 2
Yes Yes 38 Yes Yes No Yes No
Yes No 50 No 0 1 3 0
No No 83 No
Part Deux:
Regression Trees
201
Regression Trees: Main Ideas Part 1
The Problem: We have this Training …and tting a straight line to the data
1 Dataset that consists of Drug
E ectiveness for di erent Doses…
would result in terrible predictions…
100 100
…because there are
clusters of ine ective
Drug 75 Drug 75
E ectiveness
E ectiveness
Doses that surround the
% 50 % 50 e ective Doses.
25 25
10 20 30 40 10 20 30 40
Drug Dose Drug Dose
Just like Classi cation Trees, Regression Trees are relatively easy to interpret
2 A Solution: We can use
a Regression Tree, and use. In this example, if you’re given a new Dose and want to know how
which, just like a E ective it will be, you start at the top and ask if the Dose is < 14.5…
Classi cation Tree,
can handle all types of Dose < 14.5 …if it is, then the drug will be 4.2%
data and all types of Yes No
relationships among
variables to make
E ective. If the Dose is not < 14.5,
then ask if the Dose ≥ 29… BAM!!!
decisions, but now the 4.2% E ective Dose ≥ 29
output is a continuous Yes No …if so, then the drug will be
value, which, in this 2.5% E ective. If not, ask if the
case, is Drug Dose ≥ 23.5…
E ectiveness. 2.5% E ective Dose ≥ 23.5
Yes No …if so, then the drug will be
52.8% E ective. If not, the drug
52.8% E ective 100% E ective will be 100% E ective…
202
ff
ff
ff
ff
ff
ff
ff
fi
fi
ff
ff
ff
ff
ff
ff
ff
ff
fi
ff
Regression Trees: Main Ideas Part 2 100
% 50
10 20 30 40 Yes No
Drug Dose
52.8% E ective 100% E ective
% 50
% 50
6 and 29, then the
output is the
average
25 E ectiveness
25
from these 5
measurements,
10 20 30 40 52.8%.
10 20 30 40 Drug Dose
Drug Dose
203
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
Building a Regression Tree: Step-by-Step
Given these Training Data, we want to Just like for Classi cation Trees, the
1 build a Regression Tree that uses Drug 2 rst thing we do for a Regression Tree
Dose to predict Drug E ectiveness. is decide what goes in the Root.
100
???
Drug 75
E ectiveness
% 50
25
10 20 30 40
Drug Dose
To make that decision, we calculate …and then we build a very simple tree that
3 the average of the rst 2 Doses, splits the measurements into 2 groups
which is 3 and corresponds to this based on whether or not the Dose < 3.
dotted line…
100 100
Dose < 3
Yes No
Drug 75 75
E ectiveness
204
fi
ff
ff
ff
ff
fi
fi
ff
Building a Regression Tree: Step-by-Step
For the one point with …the Regression In contrast, for this speci c …the tree
5 Dose < 3, which has Tree makes a pretty 6 point, which has Dose > 3 predicts that the
E ectiveness will
E ectiveness = 0… good prediction, 0. and 100% E ectiveness…
be 38.8, which is
a pretty bad
100 100
Dose < 3 prediction.
Yes No
Drug 75 75
E ectiveness
Dose < 3
% 50 Average = 0 Average = 38.8
50
Yes No
25 25
Average = 0 Average = 38.8
10 20 30 40 10 20 30 40
Drug Dose Drug Dose
We can visualize how good or bad the We can also quantify how good or bad
7 Regression Tree is at making predictions
by drawing the Residuals, the di erences
the predictions are by calculating the
Sum of the Squared Residuals (SSR)…
Lastly, we can
compare the SSR for
between the Observed and Predicted di erent thresholds by
values. plotting them on this
…and when the threshold for the tree graph, which has Dose
100 is Dose < 3, then the SSR = 27,468.5. on the x-axis and SSR
on the y-axis.
75
(0 - 0)2 + (0 - 38.8)2 + (0 - 38.8)2 + (0 - 38.8)2
50 30,000
+ (5 - 38.8)2 + (20 - 38.8)2 + (100 - 38.8)2
25 15,000
+ (100 - 38.8)2 + … + (0 - 38.8)2
SSR
= 27,468.5
10 20 30 40 10 20 30 40
Drug Dose Dose
205
ff
ff
ff
ff
ff
fi
ff
Building a Regression Tree: Step-by-Step 100
Now we shift the Dose threshold to …and we build this super Drug 75
8 be the average of the second and simple tree with Dose < 5 at E ectiveness
% 50
third measurements in the graph, 5… the Root.
100 25
Dose < 5
Yes No
Drug 75 10 20 30 40
E ectiveness
10 20 30 40 30,000
Dose 25
50 30,000 50 30,000
25 15,000 25 15,000
SSR SSR
10 20 30 40 10 20 30 10 20 30 40 10 20 30
Drug Dose Dose Drug Dose Dose
14 Then, after shifting the Dose threshold 7 And nally, after shifting the …the SSR graph
more times, the SSR graph looks like this.
15 Dose threshold all the way to
the last pair of Doses…
looks like this.
75 75 Average = 39 Average = 0
Average = 46 Average = 17
50 30,000 50 30,000
10 20 30 40 10 20 30 10 20 30 40 10 20 30 40
Drug Dose Dose Drug Dose Dose
207
fi
Building a Regression Tree: Step-by-Step …which corresponds to splitting
the measurements into two
Looking at the SSRs for each Dose …so Dose < 14.5 will be groups based on whether or not
16 threshold, we see that Dose < 14.5 the Root of the tree…
100
the Dose < 14.5.
had the smallest SSR…
30,000 50
15,000 25
SSR
10 20 30 40 10 20 30 40
Dose Drug Dose
Now, because the threshold in the Root of In other words, just like before, we can average the rst
17 the tree is Dose < 14.5, these 6 18 two Doses and use that value, 3, as a cuto for splitting
the 6 measurements with Dose < 14.5 into 2 groups…
measurements go into the Node on the left…
Dose < 3
75
100 …then we
Average = 0 Average = 5
50 calculate the SSR
…and, in theory, we for just those 6
could subdivide these 75
measurements
25 6 measurements into and plot it on a
smaller groups by 50 300 graph.
repeating what we just
10 20 30 40 did, only this time 150
Drug Dose 25
focusing only on these
6 measurements. SSR
10 20 30 40 7.25 14.5
Drug Dose Dose
208
ff
fi
Building a Regression Tree: Step-by-Step
And after calculating the SSR for each …and then we select the threshold that
19 threshold for the 6 measurements with gives us the lowest SSR, Dose < 11.5,
Dose < 14.5, we end up with this graph… for the next Node in the tree.
BAM?
7.25 14.5
25 Dose
10 20 30 40
Drug Dose
BAM!!!
Now, because there are only 6 measurements …and because we require a minimum of
21 with Dose < 14.5, there are only 6
measurements in the Node on the left…
7 measurements for further subdivision,
the Node on the left will be a Leaf…
50 % 50
…and the output value for
the Leaf is the average
25 E ectiveness from the 6 25
measurements, 4.2%.
10 20 30 40 10 20 30 40
Drug Dose Drug Dose
Now we need to gure out what to do with the 13 Since we have more than 7 measurements in the
22 remaining measurements with Doses ≥ 14.5 that 23 Node on the right, we can split them into two
groups, and we do this by nding the threshold
go to the Node on the right.
that gives us the lowest SSR.
25 25
10 20 30 40 10 20 30 40
Drug Dose Drug Dose
210
ff
ff
ff
ff
ff
fi
fi
Building a Regression Tree: Step-by-Step
And because there are only 4 measurements …and since the Node has fewer than 7
24 with Dose ≥ 29, there are only 4
measurements in this Node…
measurements, we’ll make it a Leaf, and the
output will be the average E ectiveness from
those 4 measurements, 2.5%.
100
25 2.5% E ective
10 20 30 40
Drug Dose
Now, because we have more than 7 measurements …we can split the measurements into
25 with Doses between 14.5 and 29, and thus, more
than 7 measurements in this Node…
two groups by nding the Dose
threshold that results in the lowest SSR.
100 100
10 20 30 40 10 20 30 40
Drug Dose Drug Dose
211
ff
ff
ff
ff
ff
ff
ff
ff
fi
ff
Building a Regression Tree: Step-by-Step
26 And since there are fewer than 7
measurements in each of these two groups…
…this will be the last split, because
none of the Leaves has more than
7 measurements in them.
100
10 20 30 40
Drug Dose
So, we use the average Drug E ectiveness for …and we use the average Drug E ectiveness for
27 measurements with Doses between 23.5 and 29,
52.8%, as the output for Leaf on the left…
observations with Doses between 14.5 and 23.5,
100%, as the output for Leaf on the right.
100 100
Dose < 14.5
Drug 75 Drug 75
E ectiveness
4.2% E ective Dose ≥ 29 E ectiveness
% 50 % 50
213
Building a Regression Tree With Multiple Features: Part 1
So far, we’ve built a Regression Tree Now let’s talk about how to
1 using a single predictor, Dose, to
predict Drug E ectiveness.
2 build a Regression Tree to
predict Drug E ectiveness
NOTE: Just like for
Classi cation
using Dose, Age, and Sex. Trees, Regression
100
Trees can use any
Drug type of variable to
Dose
Drug 75 E ect make a prediction.
E ectiveness
Drug However, with
10 98 Dose Age Sex
% 50 E ect Regression Trees,
20 0 we always try to
10 25 F 98 predict a
25 35 6
20 73 M 0 continuous value.
5 44
35 54 F 6
etc… etc…
10 20 30 40 5 12 M 44
Drug Dose
etc… etc… etc… etc…
First, we completely ignore Age …and then we select the However, instead of that threshold
3 and Sex and only use Dose to
predict Drug E ectiveness…
threshold that gives us
the smallest SSR.
instantly becoming the Root, it only
becomes a candidate for the Root.
20 73 M 0 % 50 Yes No
35 54 F 6
5 12 M 44 25 Average = 4.2 Average = 51.8
10 20 30 40
Drug Dose
214
ff
ff
ff
ff
ff
fi
ff
ff
ff
Building a Regression Tree With Multiple Features: Part 2
Then, we ignore Dose and Sex …and we select the
4 and only use Age to predict
E ectiveness…
threshold that gives us
…and that becomes the second
candidate for the Root.
the smallest SSR…
100
Drug
Dose Age Sex Age > 50
E ect
75 Yes No
10 25 F 98
50
20 73 M 0 Average = 3 Average = 52
35 54 F 6
25
5 12 M 44
etc… etc… etc… etc…
20 40 60 80
Age
Lastly, we ignore Dose and Age …and even though Sex only has one
5 and only use Sex to predict
E ectiveness…
threshold for splitting the data, we
still calculate the SSR, just like
…and that becomes the third
candidate for the Root.
before…
100
Drug
Dose Age Sex Sex = F
E ect
75 Yes No
10 25 F 98
50
20 73 M 0 Average = 52 Average = 40
35 54 F 6
25
5 12 M 44
etc… etc… etc… etc…
Female Male
215
ff
ff
ff
ff
Building a Regression Tree With Multiple Features: Part 3
Average = 3 Average = 52
Sex = F
Yes No
SSR = 20,738
216
Building a Regression Tree With Multiple Features: Part 4
Now that Age > 50 is the Root, the people
7 in the Training Data who are older than 50
…and the people who are ≤ 50 go to the
Node on the right.
go to the Node on the left…
10 25 F 98 10 25 F 98
20 73 M 0 20 73 M 0
35 54 F 6 Yes No Yes No 35 54 F 6
5 12 M 44 5 12 M 44
etc… etc… etc… etc… etc… etc… etc… etc…
Then we grow the tree just like before, except now for
8 each split, we have 3 candidates, Dose, Age, and Sex,
…until we can no longer subdivide the
data any further. At that point, we’re
and we select whichever gives us the lowest SSR… done building the Regression Tree.
0% 5% 0% 5% 75% 39%
Support Vector
Classifiers and
Machines (SVMs)!!!
Support Vector Machines (SVMs): Main Ideas
The Problem: We measured how much
In theory, we could create a
…and then use Logistic
1 Popcorn people ate, in grams, and we
want to use that information to predict if y-axis that represented whether Regression to t an s-shaped
someone will love Troll 2. The red dots or not someone loves Troll 2 and squiggle to the data…
represent people who do not love Troll 2. move the dots accordingly…
The blue dots represent people who do.
Yes!
…however, this s-shaped
Love Troll 2? squiggle would make a
terrible classi er because it
would misclassify the people
No who ate the most Popcorn.
A Solution: A Support Vector Machine (SVM) can In this example, the x-axis is the amount
2 add a new axis to the data and move the points in
a way that makes it relatively easy to draw a
of Popcorn consumed by each person,
and the new y-axis is the square of the
straight line that correctly classi es people. Popcorn values.
YUM!!! I love
popcorn!!!
Now, all of the points to the left
of the straight line are people
who do not love Troll 2…
Popcorn2
If all the people who do …and all of the people However, as you’ve probably already guessed,
1 not love Troll 2 didn’t eat who love Troll 2 ate a lot
2 this threshold is terrible because if someone
new comes along and they ate this much
much Popcorn… of Popcorn…
Popcorn…
…then we could
Popcorn (g) pick this threshold
to classify people.
5
One way to make a threshold for
classi cation less sensitive to 6
NOTE: Allowing the threshold
to misclassify someone from 7 the threshold halfway
between these two people…
outliers is to allow misclassi cations. the Training Data…
For example, if we put the threshold
halfway between these two people…
221
fi
ff
fi
fi
fi
fi
fi
fi
fi
fi
= Loves Troll 2
Support Vector Classifiers: Details Part 3 = Does Not Love Troll 2
Now, imagine Cross Validation determined However, if, in addition to measuring how much Popcorn
8 that putting a threshold halfway between
these two points gave us the best results…
9 people ate, we also measured how much Soda they
drank, then the data would be 2-Dimensional…
…and a Support
Vector Classi er
would be a straight
Soda (ml) line. In other words,
the Support Vector
Classi er would be
1-Dimensional.
…then we would call this threshold
a Support Vector Classi er. Popcorn (g)
And if we measured
4 things, then the
Popcorn (g)
Soda (ml) data would be 4-
Dimensional, which
we can’t draw, but
the Support Vector
Classi er would be
…the Support Vector Classi er, the threshold Age Popcorn (g) 3-Dimensional.
we use to decide if someone loves or does not Etc., etc., etc.
love Troll 2, is just a point on a number line.
222
fi
fi
fi
fi
fi
fi
Support Vector Classifiers: Details Part 4
Support Vector Classi ers …but what if, in terms of Popcorn
11 seem pretty cool because
they can handle outliers…
consumption, the people who do not love
Troll 2 surrounded the people who did? = Loves Troll 2
= Does Not Love Troll 2
Now, no matter where we put the Double Ugh!! Triple Ugh!!! These and
12 Support Vector Classi er, we’ll
make a lot of misclassi cations.
Ugh! These are
misclassi ed!
These and these
are misclassi ed!!!
these are misclassi ed!!!
Can we do better???
223
fi
fi
fi
fi
fi
fi
fi
Terminology Alert!!! Oh no! It’s the dreaded Terminology Alert!!!
Hey Norm, do
you have any
good Troll 2
trivia? …and any points closer to the threshold
are called Support Vectors.
= Loves Troll 2
= Does Not Love Troll 2
The distance between the points that
2 de ne the threshold and the threshold
itself is called the Margin.
Popcorn2
Popcorn2 Popcorn2
Popcorn2
Popcorn2 Popcorn2
Because each person has x- and …and now that the data are 2-Dimensional, we can
6 y-axis coordinates, the data are draw a Support Vector Classi er that separates the
now 2-Dimensional… people who love Troll 2 from the people who do not.
BAM!!!
Popcorn2
Popcorn2
a 1-Dimensional data on a
number line.
Popcorn2
Popcorn2
I love making
popcorn from
Popcorn Consumed (g) kernels!!!
(a x b + r)d (a x b + r )d
Polynomial Kernel:
Popcorn2
(a x b + r )d = (a x b + 1 2 1 1
2
) = (a x b + 2
)(a x b + 2
)
…and we can multiply NOTE: Dot Products sound fancier
both terms together… than they are. A Dot Product is just…
1 1 1
= a2b2 + ab + ab +
2 2 4
(a, a2, 1
2
) . (b, b2, 1
2
)
…and then combine
the middle terms…
…the rst terms
1 multiplied together…
1
= a2b2 + ab + 4
ab + a2b2 + 4
…and, just because it will make
things look better later, let’s ip the
order of the rst two terms… (a, a2, 1
2
) . (b, b2, 1
2
)
1
= ab + a2b2 + 4 …plus the second terms
1 multiplied together…
…and nally, this polynomial is ab + a2b2 + 4
equal to this Dot Product!
= (a, a2, 1
) . (b, b2, 1
)
Hey Norm, 2 2 (a, a2, 1
2
) . (b, b2, 1
2
)
what’s a Dot
Product?
‘Squatch, Dot …plus the third terms
multiplied together.
Products are 1
explained in the ab + a2b2 + 4 Bam.
NOTE on the right.
228
fi
fi
fi
fl
Support Vector Machines (SVMs): Details Part 3
To summarize the last …and, after a lot of …and then …so now we need to learn
4 page, we started with the
Polynomial Kernel…
…set r = 1/2
and d = 2…
math, ended up with
this Dot Product…
StatSquatch why we’re so excited
about Dot Products!!!
learned that
Dot Products
sound fancier
) . (b, b2,
1 2 1 1 than they are…
(a x b + r )d = (a x b + ) = (a, a2, ) Bam!
2 2 2
(a, a2, 1
2
) . (b, b2, 1
2
)
(a, a2, 1
2
) . (b, b2, 1
2
) Popcorn2
BAM!!!
Popcorn Consumed (g)
229
ff
fi
Support Vector Machines (SVMs): Details Part 4
The reason Support Vector Machines use Instead, the Kernels use the Dot Products
9 Kernels is that they eliminate the need to
actually transform the data from low
10 between every pair of points to compute
their high-dimensional relationships and nd
dimensions to high dimensions. the optimal Support Vector Classi er.
Popcorn2
Popcorn2
Umm… How do
we nd the best
values for r and d?
(a, a2, 1
2
) . (b, b2, 1
2
)
Just try a bunch of
values and use Cross
Validation to pick the
best ones.
230
fi
ff
fi
fi
Support Vector Machines (SVMs): Details Part 5
For example, if this …and this person
13 person ate 5 grams ate 10 grams of
of Popcorn… Popcorn… Popcorn2
:( BAM!!!
231
fi
fi
fl
fi
fi
The Radial Kernel: Intuition The basic idea behind the Radial
Earlier in this chapter, I mentioned that
2 Kernel is that when we want to
Since we’ve already talked classify a new person…
1 two of the most popular Kernel
Functions for Support Vector
about the Polynomial
Kernel, let’s get an
Machines are the Polynomial Kernel
intuition about how the
and the Radial Kernel, which is also
Radial Kernel works.
called the Radial Basis Function. Popcorn Consumed (g)
…we simply look to see how the closest points
from the Training Data are classi ed. In this
The equation for the Radial Kernel might case, the closest points represent people who
3 look scary, but it’s not that bad. do not love Troll 2…
DOUBLE BAM!!!
Popcorn Consumed (g) BAM!!!
4 Believe it or not, the Radial Kernel is like a
Polynomial Kernel, but with r = 0 and d = in nity… …and that means that the Radial
Kernel nds a Support Vector
Classi er in in nite dimensions, which
sounds crazy, but the math works out.
(a x b + r )d = (a x b + 0)∞ For details, scan, click, or tap this QR
code to check out the StatQuest.
232
fi
fi
fi
fi
fi
ff
fl
fi
fi
Hey Norm,
what’s next?
Neural Networks, which sound crazy
scary, but as we’ll soon see, they’re
just Big Fancy Squiggle-Fitting
Machines. So don’t be worried!
233
Chapter 12
Neural Networks!!!
Neural Networks
Part One:
235
Neural Networks: Main Ideas
The Problem: We have data that …and the s-shaped squiggle that Logistic Regression
1 show the E ectiveness of a drug
for di erent Doses….
uses would not make a good t. In this example, it
would misclassify the high Doses.
For example, we could get …to t a fancy …or we could get this …to t a bent shape
this Neural Network… squiggle to the data… Neural Network… to the data.
100
100 BAM!!!
50
50
0
0 Low Medium High
Low Medium High
Drug Dose
Drug Dose
236
fi
ff
ff
fi
fi
ff
ff
fi
fi
Terminology Alert!!! Anatomy of a Neural Network (Oh no! It’s the dreaded
Terminology Alert!!!)
Dose
+ 2.14 × -1.30 E ectiveness
(Input) (Output)
× -34.4
Sum + -0.58
× -2.52
+ 1.29 × 2.28
237
ff
fi
fl
fi
Terminology Alert!!! Layers
(Oh no! A Double Terminology Alert!!!)
…and usually there are multiple
2 Output Nodes that form an
Output Layer…
Neural Networks are organized in
…and the Layers of Nodes between the Input and Output Layers are called
3 Hidden Layers. Part of the art of Neural Networks is deciding how many Hidden
Layers to use and how many Nodes should be in each one. Generally speaking, the
more Layers and Nodes, the more complicated the shape that can be t to the data.
Dose
+ 2.14 x -1.30 E ectiveness
(Input) (Output)
x -34.4
Sum + -0.58
x -2.52
+ 1.29 x 2.28
238
ff
fi
Terminology Alert!!! Activation Functions (Oh no! A TRIPLE TERMINOLOGY ALERT!!!)
x -34.4
Sum + -0.58
x -2.52
+ 1.29 x 2.28
For example, the So, if we plug in …then the SoftPlus will tell
Although they might sound fancy,
SoftPlus function is: an x-axis value, us the y-axis coordinate is
Activation Functions are just like
x = 2.14…
3 the mathematical functions you
learned when you were a
SoftPlus(x) = log(1 + ex)
2.25, because
50 50 50
0 0 0
Low Medium High Low Medium High Low Medium High
Drug Dose Drug Dose Drug Dose
50
0
Low Medium High
Dose
E ectiveness
Drug Dose
(Input) + 2.14 (Output)
x -1.30
x -34.4
NOTE: The purpose of this
Sum + -0.58
illustration is simply to show you
x -2.52
x 2.28 where we’re headed in the next
+ 1.29 section, so rather than think too
hard about it, read on!!!
240
ff
fi
fi
fi
A Neural Network in Action: Step-by-Step
A lot of people say that Neural Networks are black boxes and that it’s di cult to understand what they’re doing.
1 Unfortunately, this is true for big, fancy Neural Networks. However, the good news is that it’s not true for simple
ones. So, let’s walk through how this simple Neural Network works, one step at a time, by plugging in Dose
values from low to high and seeing how it converts the Dose (input) into predicted E ectiveness (output).
Dose
E ectiveness
NOTE: To keep the math in this section simple, ALSO NOTE: These numbers are parameters that are
2 let’s scale both the x- and y-axes so that they go 3 estimated using a method called Backpropagation,
and we’ll talk about how that works, step-by-step, later.
from 0, for low values, to 1, for high values.
For now, just assume that these values have already
been optimized, much like the slope and intercept are
Drug 100 1
optimized when we t a line to data.
E ectiveness
E ectiveness
% 50
Weight
241
ff
ff
ff
fi
ffi
ff
A Neural Network in Action: Step-by-Step
The rst thing we do is plug
4 the lowest Dose, 0, into the
Neural Network.
Dose
E ectiveness
Drug Dose
Dose
E ectiveness
242
ff
ff
ff
fi
A Neural Network in Action: Step-by-Step
Then we plug the x-axis …and do the math, just …and we get 2.25. So, when the x-axis
6 coordinate, 2.14 into the like we saw earlier… coordinate is 2.14, the corresponding
SoftPlus Activation y-axis value is 2.25…
Function… …thus, when Dose = 0, the
output for the SoftPlus is
SoftPlus(x) = SoftPlus(2.14) = log(1 + e2.14) = y-axis coordinate = 2.25 2.25. So, let’s put a blue
dot, corresponding to the
blue activation function,
onto the graph of the
Dose
(Input) + 2.14
X E ectiveness
(Output)
original data at Dose = 0
x -1.30 with a y-axis value = 2.25.
x -34.4 X 1
0 x -2.52
Sum + -0.58
x 2.28
+ 1.29 E ectiveness
0
0 0.5 1
Drug Dose
Now let’s increase the …and calculate the new x-axis …and plug -4.74 into the SoftPlus
7 Dose to 0.2… coordinate by multiplying the Dose by Activation Function to get the
-34.4 and adding 2.14 to get -4.74… corresponding y-axis value, 0.01.
(Dose x -34.4) + 2.14 = -0.51
(0.2 x -34.4) + 2.14 = -4.74
SoftPlus(-4.74) = log(1 + e-4.74)
= 0.01
1 1
Dose
244
ff
ff
A Neural Network in Action: Step-by-Step
…and the next step in the Neural
10 Ultimately, the blue dots
result in this blue curve…
Network tells us to multiply the y-axis
coordinates on the blue curve by -1.30.
Dose
E ectiveness
E ectiveness
Sum + -0.58
x -2.52
x 2.28
0 + 1.29
0 0.5 1
Drug Dose
For example, when Dose = 0, the Likewise, when we multiply all …we end up ipping
11 current y-axis coordinate on the
12 of the y-axis coordinates on the and stretching the
blue curve is 2.25… blue curve by -1.30… original blue curve to
…and when we get a new blue curve.
multiply 2.25 by -1.30,
we get a new y-axis
coordinate, -2.93.
1 1
1 1
E ectiveness
E ectiveness
0 0
0 0
0 0.5 1 0.5 1
Drug Dose Drug Dose
245
ff
ff
ff
ff
fl
A Neural Network in Action: Step-by-Step
Okay, we just nished a major
13 step, so this is a good time to
review what we’ve done so far.
BAM!!!
1 Dose
X E ectiveness
E ectiveness
0
0 0.5 1
Drug Dose
246
ff
ff
fi
A Neural Network in Action: Step-by-Step
Now we run Dose values through
14 the connection to the bottom Node
in the Hidden Layer.
Dose
E ectiveness
247
ff
ff
ff
ff
ff
A Neural Network in Action: Step-by-Step
So, now that we have
16 an orange curve…
…the next step in the Neural Network is to
add their y-coordinates together.
…and a blue
curve…
Dose
E ectiveness
248
ff
ff
A Neural Network in Action: Step-by-Step
18
1
E
0
249
nal step, which is to add -0.58
to each y-axis value on the green squiggle.
Dose
E ectiveness
0
Likewise, subtracting 0.58
21 from all of the other y-axis
coordinates on the green
squiggle shifts it down…
…and, ultimately, we end
up with a green squiggle
1 DOUBLE
1
that ts the Training Data.
BAM!!!
0
0 0 0.5 1
ff
fi
fi
A Neural Network in Action: Step-by-Step
Hooray!!! At long last, we see …that this Neural Network uses to
22 the nal green squiggle… predict E ectiveness from Doses.
1
Dose
E ectiveness
0 0.5 x -2.52
Sum + -0.58 1.0
0.5 1 x 2.28
Drug Dose + 1.29
1
TRIPLE Now let’s learn
how to t a
Neural Network
to data!
E ectiveness
0
BAM!!!
0.5 1
Drug Dose
251
ff
ff
ff
ff
fi
fi
ff
ff
ff
Neural Networks
Part Deux:
Fitting a Neural
Network to Data with
Backpropagation
252
Backpropagation: Main Ideas
The Problem: Just like for Linear Regression, Neural Networks have
1 parameters that we need to optimize to t a squiggle or bent shape to data.
How do we nd the optimal values for these parameters?
Dose
E ectiveness
253
ff
ff
fi
fi
fi
Terminology Alert!!! Weights and Biases OH NO! It’s the
dreaded Terminology
Alert!!!
In Neural Networks,
1 the parameters that
we multiply are called + 2.14 Output
x -1.30
Weights… x -34.4
Sum + -0.58
x -2.52
x 2.28
+ 1.29
1
Dose
E ectiveness
255
ff
Backpropagation: Details Part 2
Now that the Neural Network has …we’re ready to add the Final
3 created the green squiggle… Bias to its y-axis coordinates.
Dose
E ectiveness
0
…and adding 0 to all of
the y-axis coordinates on
However, because we don’t yet know the optimal the green squiggle leaves
4 value for the Final Bias, we have to give it an
initial value. Because Bias terms are frequently
it right where it is…
…which means
1 the green
x -1.30 Output squiggle doesn’t
t the Training
Sum + 0.0 Data very well.
x 2.28 0
0.5 1
256
fi
ff
Backpropagation: Details Part 3
Now, just like we did for R2, Linear Regression, and
5 Regression Trees, we can quantify how well the green 1
squiggle ts all of the Training Data by calculating the
Sum of the Squared Residuals (SSR).
n
∑
0
SSR = (Observedi - Predictedi)2 0.5 1
Drug Dose
i=1
Then we add the
For example, for the rst Dose, 0, the Residual for when
6 Observed E ectiveness is 0 and the Dose = 0.5. Now the
green squiggle created by the Neural Observed E
Network predicts 0.57, so we plug in E ectiveness is 1, but 1
0 for the Observed value and 0.57 for the green squiggle
the Predicted value into the equation predicts 1.61.
for the SSR.
0
0.5 1
Drug Dose
x -1.30 Output
Sum + 0.0
x 2.28
257
ff
fi
ff
fi
erent
values for the Final Bias by plotting them on this
graph that has values for the Final Bias on the x-
axis and the corresponding SSR on the y-axis.
x -1.30
Sum + -0.25
1
x 2.28
0
0.5 1
Drug Dose
ff
…however, instead of just plugging in a bunch of numbers
erent values for the 14 at random, we’ll use Gradient Descent to quickly nd the
lowest point in the pink curve, which corresponds to the
Final Bias, we can see that the lowest SSR
Final Bias value that gives us the minimum SSR…
occurs when the Final Bias is close to 0.5…
d SSR
d Final Bias
ff
fi
Backpropagation: Details Part 6
n
Remember, each Predicted
15 value in the SSR comes from
SSR =
∑
(Observedi - Predictedi)2
the green squiggle… i=1
E
1
260
…and multiplying
everything by the derivative
rst
…which, in turn, of the stu inside the
part, the derivative of the …by moving the
can be solved using parentheses, which is -1…
SSR with respect to the square to the front…
The Chain Rule…
Predicted values…
n n
d
d Predicted ∑ ∑
= (Observedi - Predictedi)2 = 2 x (Observedi - Predictedi) x -1
i=1 i=1
n
…and, lastly, we simplify
d SSR
∑
= -2 x (Observedi - Predictedi) by multiplying 2 by -1.
NOTE: For more d Predicted
i=1 BAM!!!
details on how to
solve for this We solved for the rst part of
derivative, see the derivative. Now let’s solve
for the second part.
Chapter 5.
ff
fi
fi
Backpropagation: Details Part 9
The second part, the derivative …is the derivative of the …which, in turn, is the derivative
21 of the Predicted values with
respect to the Final Bias…
green squiggle with
respect to the Final Bias…
of the sum of the blue and orange
curves and the Final Bias.
x -1.30
Sum + -0.25
d
(blue curve + orange curve + Final Bias) = 0 + 0 + 1
x 2.28 d Final Bias
263
Backpropagation: Details Part 10
Now, to get the derivative of the
23 SSR with respect to the Final
Bias, we simply plug in…
…the derivative of the SSR with
respect to the Predicted values…
…and the derivative of the Predicted
values with respect to the Final Bias.
n
d SSR d Predicted = 1
∑
= -2 x (Observedi - Predictedi)
d Predicted d Final Bias
i=1
25
4 4 4
In the next section, we’ll SSR SSR SSR
plug the derivative into
2 2 2
Gradient Descent and
solve for the optimal value
for the Final Bias.
-0.5 0 0.5 -0.5 0 0.5 -0.5 0 0.5
Final Bias Final Bias Final Bias
264
Backpropagation: Step-by-Step
Now that we have the …which tells us how the SSR changes …we can optimize the Final
1 derivative of the SSR with when we change the Final Bias… Bias with Gradient Descent.
respect to the Final Bias…
4
n 1
d SSR SSR
∑
= -2 x (Observedi - Predictedi) x 1 2
d Final Bias E ectiveness
i=1
0 -0.5 0 0.5
0.5 1 Final Bias
NOTE: We’re leaving the “x 1” term Drug Dose
in the derivative to remind us that it
comes from The Chain Rule.
However, multiplying by 1 doesn’t
do anything, and you can omit it.
0 + -2 x ( Observed2 - Predicted2 ) x 1
0.5 1
Drug Dose
+ -2 x ( Observed3 - Predicted3 ) x 1
265
ff
ff
Backpropagation: Step-by-Step
Then we initialize the Final Bias
3 to a random value. In this case,
we’ll set the Final Bias to 0.0.
Input + 2.14 Output
x -1.30
x -34.4
Sum + 0.0
x -2.52
x 2.28
+ 1.29
X X d SSR
= -2 x ( 0 - Predicted ) x 1
0 d Final Bias
0.5 1 0 + -2 x ( 1 - Predicted ) x 1
Drug Dose 0.5 1
Drug Dose + -2 x ( 0 - Predicted ) x 1
266
Backpropagation: Step-by-Step
Now we evaluate the derivative at the When we do the math, …thus, when the Final Bias = 0, the
5 current value for the Final Bias,
which, at this point, is 0.0.
we get 3.5… slope of this tangent line is 3.5.
4
d SSR
= -2 x ( 0 - 0.57 ) x 1 SSR
d Final Bias 2
Output + -2 x ( 1 - 1.61 ) x 1 = 3.5
x -1.30
+ -2 x ( 0 - 0.58 ) x 1
-0.5 0 0.5
Sum + 0.0 Final Bias
x 2.28
NOTE: In this
Then we calculate the Step
Size with the standard
example, we’ve
set the Learning
7 Lastly, we calculate a new value for the
Final Bias from the current Final Bias…
Step Size = Derivative x Learning Rate New Bias = Current Bias - Step Size
Gentle Reminder: Step Size = 3.5 x 0.1 New Bias = 0.0 - 0.35
The magnitude of
the derivative is Step Size = 0.35 Remember, we
New Bias = -0.35 …and the new Final
proportional to initialized the Final
Bias is -0.35, which
how big of a step Bias to 0.0.
results in a lower SSR.
we should take 4
toward the SSR
minimum. The sign
2
(+/-) tells us in
what direction.
-0.5 0 0.5
Final Bias
267
Backpropagation: Step-by-Step
8 With the new value for
the Final Bias, -0.35…
…we shift the green squiggle down a
bit, and the Predicted values are closer
to the Observed values.
BAM!!!
x -1.30
Output
1 X
Sum + -0.35
x 2.28 X X
0
0.5 1
Drug Dose
269
fl
ff
fi
fi
ffi
ff
fl
ff
fi
Hooray!!!
We made it to the end of
an exciting book about
machine learning!!!
Congratulations!!!
TRIPLE BAM!!!
270
Appendices!!!
Pie Probabilities
272
Appendix A: Pie Probabilities NOTE: In this case, randomly meeting one person who prefers
pumpkin or blueberry pie does not a ect the next person’s pie
preference. In probability lingo, we say that these two events,
We’re in StatLand, where 70% of the In other words, 7 discovering the pie preferences from two randomly selected
1 people prefer pumpkin pie and 30% of
the people prefer blueberry pie.
out of 10 people in
StatLand, or 7/10,
people, are independent. If, for some strange reason, the
second person’s preference was in uenced by the rst, the
prefer pumpkin pie, events would be dependent, and, unfortunately, the math
Pumpkin Pie Blueberry Pie and the remaining ends up being di erent from what we show here.
3 out of 10 people,
or 3/10, prefer
blueberry pie. Thus, the probability that two
people in a row will prefer
pumpkin pie is 7/10ths of the
original 7/10ths who preferred
pumpkin pie, which is 7/10 x
…only 7/10ths will be 7/10 = 49/100, or 49%.
followed by people who also
prefer pumpkin pie.
Of the 7/10ths of
2 Now, if we just randomly ask
someone which type of pie they
3
the people who 7/10
prefer, 7 out of 10 people, or 7/10ths say they prefer
of the people, will say pumpkin… pumpkin pie… 0.7 x 0.7 = 0.49
7/10
3/10
BAM!!!
7/10
3/10 7/10
3/10
7/10
0.7 x 0.7 = 0.49
3/10
DOUBLE BAM!!!
3/10 7/10
3/10
274
fi
fi
fi
fi
Appendix A: Pie Probabilities
The probability that the rst
6 person says they prefer blueberry
pie and the next two say they
prefer pumpkin is also 0.147.
7/10
7/10
3/10
7/10
0.3 x 0.7 x 0.7 = 0.147
3/10 7/10
3/10
275
fi
Appendix A: Pie Probabilities
Lastly, the probability that the
7 rst person prefers pumpkin pie,
the second person prefers
blueberry, and the third person
prefers pumpkin is also 0.147.
7/10
7/10
7/10 0.7 x 0.3 x 0.7 = 0.147
3/10
7/10
0.7 x 0.3 = 0.21 3/10 TRIPLE
BAM!!!
3/10
3/10
276
fi
Appendix B:
277
Appendix B: The Mean, Variance, and Standard Deviation
Imagine we went to all 5,132 Spend-n-Save food …but because there’s a lot
1 stores and counted the number of green apples that of overlap in the data, we
can also draw a Histogram
were for sale. We could plot the number of green
apples at each store on this number line… of the measurements.
0 20 40
0 20 40 Number of apples
Number of apples
Because we counted the number of green apples
0 20 40 0 20 40
278
fi
fi
fi
Appendix B: The Mean, Variance, and Standard Deviation
In other words, we want to calculate how The formula for calculating
6 the data are spread around the Population 7 the Population Variance is…
Mean (which, in this example, is 20).
μ = 20
Population ∑ (x - μ)2
Variance = n
0 20 40 …which is a pretty fancy-looking
formula, so let’s go through it
one piece at a time.
Population ∑ (x - μ)2 0 20 40
Variance = n
Population ∑ (x - μ)2
For example, the rst measurement is 2, Variance = n
so we subtract μ, which is 20, from 2… (2 - 20)2 + (8 - 20)2 + … + (28 - 20)2
279
ff
fi
Appendix B: The Mean, Variance, and Standard Deviation
Now that we know how to …when we do the Nope,
9 calculate the Population
Variance…
math, we get 100.
BAM?
not yet.
μ = 20
…and we
To solve this problem, we take the …and because the Population can plot that
11 square root of the Population
Variance to get the Population
Variance is 100, the Population
Standard Deviation is 10…
on the graph. μ = 20
Standard Deviation…
to the data.
If we don’t usually
BAM!!! 0 20
Number of apples
40
calculate Population
Parameters, what do we
do???
280
fi
Appendix B: The Mean, Variance, and Standard Deviation
Instead of calculating Population Estimating the Population Mean is super
14 Parameters, we estimate them 15 easy: we just calculate the average of the
…and when we do the
math, we get 17.6.
from a relatively small number of measurements we collected…
measurements.
0 20 40
NOTE: The Estimated Mean, which is
16 often denoted with the symbol x̄ (x-bar),
is also called the Sample Mean… μ = 20
Now that we have an Estimated
…and due to the relatively small
number of measurements used to x̄ = 17.6 17 Mean, we can calculate an
Estimated Variance and Standard
calculate the Estimated Mean, it’s Deviation. However, we have to
di erent from the Population Mean. 0 20 40 compensate for the fact that we
only have an Estimated Mean,
A lot of Statistics is dedicated to quantifying and which will almost certainly be
compensating for the di erences between di erent from the Population Mean.
Population Parameters, like the Mean and
Variance, and their estimated counterparts.
Thus, when we calculate the Estimated …we compensate for the di erence between the
18 Variance and Standard Deviation
using the Estimated Mean…
Population Mean and the Estimated Mean by dividing by
number of measurements minus 1, n - 1, rather than n.
Estimated 1x2234
Estimated ∑ ( x - x̄ )2 ∑ ( x - x̄ )
2
Standard =
Variance = n-1 Deviation n123
-1
281
ff
ff
ff
ff
Appendix B: The Mean, Variance, and Standard Deviation
x̄ = 17.6
NOTE: If we had divided by n,
Now when we plug the
19 data into the equation for
0 20 40
instead of n - 1, we would have
gotten 81.4, which is a signi cant
the Estimated Variance… underestimate of the true
Population Variance, 100.
Estimated ∑ ( x - x̄ )2 (3 - 17.6)2 + (13 - 17.6)2 + (19 - 17.6)2 + (24 - 17.6)2 + (29 - 17.6)2
Variance = n-1
= = 101.8
5-1
Lastly, the Estimated Standard …so, in this example, the Estimated Standard
20 Deviation is just the square root Deviation is 10.1. Again, this is relatively close to
of the Estimated Variance… the Population value we calculated earlier.
TRIPLE
10.1, correspond to the distribution in green, with
purple Normal Curve… Mean = 20 and Standard
Deviation = 10.
μ = 20
0
x̄ = 17.6
20 40
BAM!!! 282
ff
fi
Appendix C:
283
Appendix C: Computer Commands for Calculating
Probabilities with Continuous Probability Distributions
Given this Normal Distribution, …let’s talk about ways we can
1 with Mean = 155.7 and
Standard Deviation = 6.6…
get a computer to calculate the
area under the curve between
142.5 and 155.7. However, before we get into the speci c
commands that we can use in Google
Sheets, Microsoft Excel, and R, we need to
talk about Cumulative Distribution
Functions.
For example, given this Normal Distribution with In this case, the
Mean = 155.7 and Standard Deviation = 6.6, a area is 0.02.
Cumulative Distribution Function for 142.5
would tell us the area under the curve for all x-axis
values to the left of, and including, 142.5. Bam.
286
Appendix C: Computer Commands for Calculating
Probabilities with Continuous Probability Distributions
7 Putting everything together,
we get 0.5 - 0.02 = 0.48.
Gentle Reminder about the arguments for the NORMDIST() function:
normdist( x-axis value, mean, standard deviation, use CDF)
DOUBLE BAM!!
288
Appendix D: The Main Ideas of Derivatives
Imagine we collected Test Scores …and we t a straight One way to understand the relationship
1 and Time Spent Studying from 3 2 line to the data. 3 between Test Scores and Time Spent
people… Studying is to look at changes in Test
Test Score Test Score Test Score
Scores relative to changes in Time
Spend Studying.
d Fullness
=3
d Time
Terminology Alert!!!
A straight line that
Tangent
Awesomeness Awesomeness touches the curve at a
Lines Slope = 5
single point is called a
tangent line.
Slope = 3
BAM!!!
machine learning, 99% of the time
Awesomeness
we can nd the Derivative of a
curved line using The Power Rule
(See Appendix E) and The Chain
Rule (See Appendix F).
292
Appendix E: The Power Rule We can calculate the derivative, the
2 change in Awesomeness with respect to
1 In this example, we have a parabola that represents the
relationship between Awesomeness and Likes StatQuest.
the change in Likes StatQuest…
d
Awesomeness
d Likes StatQuest
d
Awesomeness = (Likes StatQuest)2 (Likes StatQuest)2
d Likes StatQuest
The Power Rule tells us to multiply …and raise Likes 4 For example, when Likes StatQuest = 1, the
293
fi
Appendix E: The Power Rule
We can calculate the derivative, the
Here we have a graph of how 6 change in Happiness with respect to
5 Happy people are relative to
how Tasty the food is…
…and this is the equation
for the squiggle: the change in Tastiness…
Happy = 1 + Tasty3
d Happy d
= Happy
Happiness d Tasty d Tasty
Index YUM!!!
…by plugging in
the equation for
Happy…
Fresh, hot
fries! YUM!!!
d
(1 + Tasty3)
d Tasty
Tasty Food
Index …and taking the
derivative of each
d d d term in the equation.
(1 + Tasty3) = 1 + Tasty3
d Tasty d Tasty d Tasty
Cold, greasy fries from
yesterday. Yuck.
The constant value, 1, doesn’t The Power Rule tells us to
Happiness 7 change, regardless of the value
for Tasty, so the derivative with
multiply Tasty by the power,
which, in this case is 3…
Index
9 respect to Tasty is 0.
d
1=0
Now, when Tasty = -1, d Tasty
the slope of the d …and raise
Tasty tangent line is 3. Tasty3 =3x Tasty3-1 Tasty by the
d Tasty
Index
8 Lastly, we
recombine both = 3 x Tasty2
original power,
3, minus 1.
d terms to get the
3 x Tasty2 nal derivative.
d Tasty
d
Happy = 0 + 3 x Tasty2
= 3 x -12 = 3 d Tasty
= 3 x Tasty2 Bam!
294
fi
Appendix F:
The Chain
Rule!!! NOTE: This appendix assumes that you’re already
familiar with the concept of a derivative (Appendix D)
and The Power Rule (Appendix E).
295
Appendix F: The Chain Rule Now, if someone tells us that
3 they weigh this much…
Weight Weight
X
Weight
X
Height Height
X
296
fi
fi
fi
Appendix F: The Chain Rule
7 …and if we change the value for Weight, like
make it smaller, then the value for Shoe Size
Now, if someone …then we can predict that changes. In this case, Shoe Size gets smaller, too.
6 tells us that they this is their Shoe Size…
Height
weight this much…
Height
Shoe Size
…because Weight
X
and Shoe Size are
X X
X
X Shoe Size
connected by
Height… X X
Height
X X
Weight
X
Weight Height
X the derivative for Shoe Size with respect to Weight.
d Size 1/2 1
= =
d Height 2 4
d Height 2 1/2
=2 …and that means the …and that means the
d Weight equation for Height is: equation for Shoe Size is:
Height
2
1 d Size 1
d Height Shoe Size = x Height = x Height
Height = x Weight = 2 x Weight
d Height 4
Weight d Weight
297
Appendix F: The Chain Rule
Now, because Weight …and Height can …we can plug the equation for Height
10 can predict Height… predict Shoe Size… into the equation for Shoe Size.
BAM!!!
d Height d Size d Size d Height
=2 = x
d Weight d Weight d Height d Weight
d Size 1
= 1 1
d Height 4 = x2=
4 2
298
Appendix F: The Chain Rule, A More Complicated Example
Now imagine we measured how So, we t a quadratic line with an
1 Hungry a bunch of people were
and how long it had been since
intercept of 0.5 to the measurements to
re ect the increasing rate of Hunger.
2
Likewise, we t a square root
function to the data that shows
they last had a snack. how Hunger is related to craving
Hunger Hunger ice cream.
Craves Ice
Cream
Hunger = Time2 + 0.5
The more Time that
had passed since Craves Ice Cream = (Hunger)1/2
their last snack, the
hungrier they got!
Hunger
Time Since Time Since
Last Snack Last Snack
Craves Ice
Cream Craves Ice Cream = (Hunger)1/2
X X Bummer.
Craves Ice Cream = (Time2 + 0.5)1/2
X
Time Since
X
Hunger …raising the sum by 1/2 makes
Last Snack it hard to nd the derivative with
The Power Rule.
299
fl
fi
fi
fi
Appendix F: The Chain Rule, A More Complicated Example
However, because Hunger links The Chain Rule tells us that the First, The Power Rule tells us
5 Time Since Last Snack to Craves 6 derivative of Craves Ice Cream 7 that the derivative of Hunger
Ice Cream, we can solve for the with respect to Time… with respect to Time is this
derivative using The Chain Rule!!! Hunger equation:
Hunger d Craves d Craves d Hunger
= x
d Time d Hunger d Time
Hunger = Time2 + 0.5
Craves Ice d Hunger
Cream = 2 x Time
X X
…is the derivative of
Craves Ice Cream with
…multiplied by the
derivative of Hunger
d Time
Time
respect to Hunger… with respect to Time.
= x (2 x Time)
2 x Hunger1/2 However, usually we get both
equations jammed together like
Craves Ice Cream = (Hunger)1/2 2 x Time this…
=
Craves d Craves 2 x Hunger1/2 Craves Ice Cream = (Time2 + 0.5)1/2
= 1/2 x Hunger-1/2
d Hunger d Craves Time
= …and it’s not so obvious how The
1 Hunger1/2
= d Time Chain Rule applies. So, we’ll talk
2x Hunger1/2 about how to deal with this next!!!
Hunger
BAM!!!
…and we see that when there’s a change
in Time Since Last Snack, the change in
Craves Ice Cream is equal to Time divided
by the square root of Hunger.
300
Appendix F: The Chain Rule, When The Link Is Not Obvious
Now that we’ve created Inside,
1
In the last part, we said that raising the
sum by 1/2 makes it di cult to apply 2 First, let’s create a link
between Time and Craves 3 the link between Time and
The Power Rule to this equation… Ice Cream called Inside, Craves, we can apply The
which is equal to the stu inside Chain Rule to solve for the derivative.
the parentheses…
The Chain Rule tells us that the
Craves Ice Cream = (Time2 + 0.5)1/2 derivative of Craves with respect
Inside = Time2 + 0.5
to Time…
…but there was an obvious way to …and that means Craves Ice
link Time to Craves with Hunger, so Cream can be rewritten as the d Craves d Craves d Inside
we determined the derivative with square root of the stu Inside. = x
d Time d Inside d Time
The Chain Rule.
301
ff
ffi
ff
ff
Acknowledgments
Acknowledgments
The idea for this book came from comments on my YouTube channel. I’ll admit, when I
rst saw that people wanted a StatQuest book, I didn’t think it would be possible
because I didn’t know how to explain things in writing the way I explained things with
pictures. But once I created the StatQuest study guides, I realized that instead of
writing a book, I could draw a book. And knowing that I could draw a book, I started to
work on this one.
This book would not have been possible without the help of many, many people.
First, I’d like to thank all of the Triple BAM supporters on Patreon and YouTube: U-A
Castle, J. Le, A. Izaki, Gabriel Robet, A. Doss, J. Gaynes, Adila, A. Takeh, J. Butt, M.
Scola, Q95, Aluminum, S. Pancham, A. Cabrera, and N. Thomson.
I’d also like to thank my copy editor, Wendy Spitzer. She worked magic on this book
by correcting a million errors, giving me invaluable feedback on readability, and making
sure each concept was clearly explained. Also, I’d like to thank the technical editors,
Adila, Gabriel Robet, Mahmud Hasan, PhD, Ruizhe Ma, and Samuel Judge who
helped me with everything from the layout to making sure the math was done properly.
Lastly, I’d like to thank Will Falcon and the whole team at Grid.ai.
303
fi
Index
Index
Independent Variables 18
AUC 158-159
Backpropagation 252-268
Learning Rate 94
Residual 58
Biases 254
Loss Function 88
Root Node (Decision Tree) 185
Margin 224
Sensitivity 143
Continuous Data 19
Models 56-57
SoftPlus Activation Function 239
Cost Function 88
Data Leakage 23
Dependent Variables 18
Null Hypothesis 71
Sum of Squared Residuals (SSR) 58-60
Discrete Data 19
Over tting 16
Support Vector 224
p-values 68-72
Support Vector Classi er 224
Exponential Distribution 54
Parameter 92
Tangent Line 291
False Positive 70
Poisson Distribution 46
Testing Data 13
Feature 18
Probability Distributions 36
True Positive Rate 146
Precision 144
Uniform Distribution 54
R2 (R-squared) 63-67
Hypothesis Testing 71
Impure 190
Recall 144
305
fi
fi
fl
ff
fi