Statquest Linear Regression Study Guide V2-1adru0

StatQuest!!!
Gradient
Descent
SSR
Intercept Value Intercept Value Intercept Value
Study Guide!!!
© 2020 Joshua Starmer All Rights Reserved
A major part of Machine Learning is
The Problem optimizing a model’s fit to the data. Neural Networks optimize
the weights associated
For example, when doing with each line that
Logistic Regression, we connects nodes.
need to find the squiggly line
that fits the data the best.
Sometimes, like for Linear Regression, there is an analytical

solution, where you plug numbers into an equation and get
the best parameters. Bam! But it’s not always that easy.
When there is no analytical solution,

The Solution - Gradient Descent Gradient Descent can save the day!
Gradient Descent is an iterative procedure, it

incrementally steps towards an optimal solution, that can
be applied in a very wide variety of situations.
…then improves the …it has found an optimal

It starts with an
guess, one step at a solution or reaches a
initial guess…
time, until… maximum number of steps.
NOTES:

The Main Ideas: Gradient Descent is used to optimize
parameters. In this example, we Later we will show how
want to optimize the y-axis intercept to optimize 2 or more
for this line. parameters.
Height = intercept + 0.64 x Weight

Height
NOTE: For now, the slope, 0.64,
is the Least Squares estimate.
We use a Loss Function to evaluate

Weight candidate parameter values.
In this example, the Loss
Since we’re optimizing the
Function is the Sum of the
y-axis intercept, we’ll start 1.12 + 0.42 + 1.32 = 3.1
Squared Residuals (SSR).
by setting it to 0, but any
value will do.
A residual is the difference
between the observed value and
the value predicted by the line.
Height = 0 + slope x Weight
NOTE: The average of the SSR, the

Mean Squared Error (MSE), is
another popular Loss Function.
Different y-axis intercept The goal is to find the minimum

values result in different SSR, but testing every possible
Sums of the Squared value would take forever.
Residuals (SSR).
1.12 + 0.42 + 1.32 = 3.1 Gradient Decent

solves this problem by
testing relatively few
values far from an
optimal solution
and increasing
the number of
SSR values tested the
closer it gets to
the optimal
By eye, this looks like the solution.
minimum SSR, but another
intercept value might be better.
Intercept Value
Residuals are the difference between the
Residuals: Observed and Predicted values.
Residual = (Observed Height - Predicted Height) = (Observed - (intercept + 0.64 x Weight)
Observed Heights are the

values we measured.
Predicted Heights come We can plug the
from the equation for the equation in for the line
equation for the line: in for the Predicted
value.
Height
X
X
Predicted Height = intercept + 0.64 x Weight
X
Weight
A Loss Function: The Sum of Squared Residuals (SSR)

Sum of Squared Residuals (SSR) = (Height - (intercept + 0.64 x Weight))2
The equation for
+ (Height - (intercept + 0.64 x Weight))2
There is one term in the SSR…
the sum for each + (Height - (intercept + 0.64 x Weight))2
observed point.
Plugging in different
values for the intercept
gives us different sums
of squared residuals.
SSR
Intercept Value
…corresponds
to the teal line.
The goal is to find the intercept
value that results in the minimal
SSR, and that corresponds to
the lowest point in the curve.
SSR
© 2020 Joshua Starmer All Rights Reserved Intercept Value

The goal is to step towards a minimum
Minimizing the SSR: SSR from a random starting point.
SSR = (Height - (intercept + 0.64 x Weight))2

SSR This equation ……and the derivative

corresponds to calculates the slope for any
this curve… value for the intercept.
Intercept Value
A large derivative A small derivative

suggests we are suggests we are
relatively far from the relatively close to
bottom… the bottom.
SSR
Intercept Value Intercept Value

X X
A negative derivative tells us that A positive derivative tells us that
the bottom is to the right of the the bottom is to the left of the
current intercept value. current intercept value.
One way to take the derivative of

Calculating the Derivative of the SSR: the SSR is to use The Chain Rule.
Step 1: Rewrite the SRR SSR = (Height - (intercept + 0.64 x Weight))2

as a function of Inside,
which is a function of the
intercept. SSR = (Inside)2 Inside = Height - (intercept + 0.64 x Weight)
Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1

of SRR and Inside. d Inside d intercept
Step 3: Plug the derivatives d SSR d SSR d Inside

= x
into The Chain Rule. d intercept d Inside d intercept
BAM! d SSR
= 2 x Inside x -1 = -2 (Height - (intercept + 0.64 x Weight))
d intercept
Gradient Boost for One Parameter, Step-by-Step
Plug observed values for Weight and Height into

Step 1: the derivative of the Loss Function.
d
Sum of Squared Residuals =
d intercept
-2(Height - (Intercept + 0.64 x Weight))

Height
+ -2(Height - (Intercept + 0.64 x Weight))
+ -2(Height - (Intercept + 0.64 x Weight))
Weight
d
d intercept
-2(1.4 - (Intercept + 0.64 x 0.5))
+ -2(1.9 - (Intercept + 0.64 x 2.3))
+ -2(3.2 - (Intercept + 0.64 x 2.9))
Observed Heights… …and observed Weights.
Initialize the variable we want to optimize (in this

Step 2: case the Intercept) with a random value.
d
d intercept
-2(1.4 - (0 + 0.64 x 0.5))
+ -2(1.9 - (0 + 0.64 x 2.3))
+ -2(3.2 - (0 + 0.64 x 2.9))
In this example the initial

value for the Intercept is 0.

Evaluate the derivative at the
Step 3.1: current value for the Intercept, 0.
When the Intercept = 0 the

d derivative, or slope, is -5.7.
d intercept
-2(1.4 - (0 + 0.64 x 0.5))
+ -2(1.9 - (0 + 0.64 x 2.3)) = -5.7 Sum of
+ -2(3.2 - (0 + 0.64 x 2.9)) Squared
Residuals
Intercept Value
NOTE: The the magnitude of the slope

proportional to how big of a step we
should take towards the minimum. The
sign (+/-) tells us what direction.
Step 4.1: Calculate the Step Size.

The Learning Rate prevents us from
The slope is the derivative taking steps that are too large and is user
evaluated at the current defined. NOTE: 0.01 is a common default
value for the Intercept. value, but we are using 0.1 in this
example.
Step Size = Slope x Learning Rate
= -5.7 x 0.1
= -0.57
Take a step closer to the

Step 5.1: optimal value for the Intercept.
New Intercept = Old Intercept - Step Size
The Old Intercept is the value = 0 - (-0.57) = 0.57 The new value for
used to determine the current the Intercept,
slope. In this case, it is 0. 0.57, moves the
line up a little bit.

Repeat steps 3, 4 and 5, using the new value for the
intercept until the Step Size is close to 0 or you take
the maximum number of steps.
Evaluate the derivative at the current

Step 3.2: value for the Intercept, 0.57.
When the Intercept = 0.57,

d the derivative, or
Sum of Squared Residuals = slope, is -2.3.
d intercept
-2(1.4 - (0.57 + 0.64 x 0.5))
+ -2(1.9 - (0.57 + 0.64 x 2.3)) = -2.3 Sum of
+ -2(3.2 - (0.57 + 0.64 x 2.9)) Squared
Residuals
The new slope shows Intercept Value

that we have taken a
step towards the lowest
point in the curve.
Step 4.2: Calculate the Step Size. NOTE: The Step Size is smaller
than before because the slope is
Step Size = Slope x Learning Rate not as steep as before. This
= -2.3 x 0.1 means we are getting closer to
= -0.23 the minimum value.

The Old Intercept is the value = 0.57 - (-0.23) = 0.8

The new value for
used to determine the current the Intercept, 0.8,
slope. In this case, it is 0.57.
moves the line up
a little bit more.

Repeat steps 3, 4 and 5, using the new value for the
intercept until the Step Size is close to 0 or you take
the maximum number of steps.
Evaluate the derivative at the current

Step 3.3: value for the Intercept, 0.8.
When the Intercept = 0.8,

d the derivative, or
Sum of Squared Residuals = slope, is -0.9.
d intercept
-2(1.4 - (0.8 + 0.64 x 0.5))
+ -2(1.9 - (0.8 + 0.64 x 2.3)) = -0.9 Sum of
+ -2(3.2 - (0.8 + 0.64 x 2.9)) Squared
Residuals
The new slope shows Intercept Value

that we have taken a
step towards the lowest
point in the curve.
Step 4.3: Calculate the Step Size. NOTE: The Step Size is smaller
than before because the slope is
Step Size = Slope x Learning Rate not as steep as before. This
= -0.9 x 0.1 means we are getting closer to
= -0.09 the minimum value.

The Old Intercept is the value = 0.8 - (-0.09) = 0.89

The new value for
used to determine the current the Intercept, 0.89,
slope. In this case, it is 0.8.
moves the line up a
little bit more.
Repeat steps 3, 4 and 5, using the new value
for the intercept until the Step Size is close to
0 or you take the maximum number of steps. BAM!
Optimizing 2 or More Parameters
In this example we will optimize
the intercept and the slope.
Height Height = intercept + slope x Weight
Weight
This is a 3-D graph of the SSR for

different values for the Intercept
and the Slope
This axis represents This axis represents

different values for the different values for the
Slope. Intercept.
Just like before, the goal is to take steps

This axis is the SSR.
towards the bottom of the graph, where
we minimize the Loss Function.
NOTES:
What did the bird

say when it stubbed
its toe? Owl!!!

For each variable, we take derivative
Taking partial derivatives of the SSR: of the SSR with The Chain Rule.
1) The derivative of the SSR with respect to the intercept:
Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2

intercept. SSR = (Inside)2 Inside = Height - (intercept + slope x Weight)
Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1

of SRR and Inside. d Inside d intercept

= x
into The Chain Rule. d intercept d Inside d intercept
d SSR
= 2 x Inside x -1 = -2 (Height - (intercept + slope x Weight))
d intercept
2) The derivative of the SSR with respect to the slope:
Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2

intercept. SSR = (Inside)2 Inside = Height - (intercept + slope x Weight)
Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 - 0 - Weight = -Weight

of SRR and Slope. d Inside d slope

= x
into The Chain Rule. d slope d Inside d slope
d SSR
= 2 x Inside x -Weight
d slope
= 2 x (Height - (intercept + slope x Weight)) x -Weight
= -2 x Weight(Height - (intercept + slope x Weight))
Step 4: Double Bam!!!

Gradient Boost for 2 or More Parameters, Step-by-Step
Plug observed values for Weight and Height into

Step 1: the derivatives of the Loss Function.
d SSR = -2(Height - (Intercept + Slope x Weight))

d intercept + -2(Height - (Intercept + Slope x Weight))
+ -2(Height - (Intercept + Slope x Weight))
Height
d SSR = -2(1.4 - (Intercept + Slope x 0.5))
d intercept
+ -2(1.9 - (Intercept + Slope x 2.3))
+ -2(3.2 - (Intercept + Slope x 2.9))
Weight
d SSR = -2 x Weight(Height - (Intercept + Slope x Weight))

d slope
= -2 x Weight(Height - (Intercept + Slope x Weight))
= -2 x Weight(Height - (Intercept + Slope x Weight))
d SSR = -2 x 0.5(1.4 - (Intercept + Slope x 0.5))

d slope
= -2 x 2.3(1.9 - (Intercept + Slope x 2.3))
= -2 x 2.9(3.2 - (Intercept + Slope x 2.9))
Initialize the variables we want to optimize (in this case

Step 2: the Intercept and the Slope) with random values.
In this example, the
initial value for the
Slope is 1…
d SSR = -2(1.4 - (0 + 1 x 0.5))

d intercept
+ -2(1.9 - (0 + 1 x 2.3))
+ -2(3.2 - (0 + 1 x 2.9))
d SSR = -2 x 0.5(1.4 - (0 + 1 x 0.5))

d slope
= -2 x 2.3(1.9 - (0 + 1 x 2.3))
…and the initial value = -2 x 2.9(3.2 - (0 + 1 x 2.9))
for the Intercept is 0.

Evaluate the derivatives for the current
Step 3: values for the Intercept, 0, and Slope, 1.
d SSR d SSR
= -2(1.4 - (0 + 1 x 0.5)) = -2 x 0.5(1.4 - (0 + 1 x 0.5))
d intercept d slope
+ -2(1.9 - (0 + 1 x 2.3)) = -1.6 = -2 x 2.3(1.9 - (0 + 1 x 2.3)) = -0.8
+ -2(3.2 - (0 + 1 x 2.9)) = -2 x 2.9(3.2 - (0 + 1 x 2.9))
Calculate the
Step 4: Step Sizes.
Step SizeIntercept = Derivative x Learning Rate Step SizeSlope = Derivative x Learning Rate
= -1.6 x 0.01 = -0.8 x 0.01
= -0.016 = -0.008
NOTE: We are using a

The good news is that, in
smaller Learning Rate (0.01)
practice, a good Learning Rate
than before (0.1) because
can be determined automatically
Gradient Descent can be
by starting large and getting
very sensitive to this
smaller with each step.
parameter.

Step 5: optimal values for the
Intercept and Slope
New Intercept = Old Intercept - Step SizeIntercept New Slope = Old Slope - Step SizeSlope
= 0 - (-0.016) = 0.016 = 1 - (-0.008) = 1.008
The new values for the

Intercept, 0.016, and
Repeat steps 3, 4 and 5, using the
Slope, 1.008, move the line new value for the intercept until the
up and increase the slope a Step Size is close to 0 or you take
little bit. the maximum number of steps.
Double BAM!
Additional Notes:
Loss Functions:
The Sum of the Squared However, there are tons of Regardless of which Loss
Residuals is just one type other Loss Functions that Function you use, Gradient
of Loss Function. work with other types of data. Descent works the same way.
Stochastic Gradient Descent:

When we use a random
When we have lots of We can speed things up by subset instead of the full
data, Gradient Descent using a randomly selected dataset, we are doing
can be slow. subset of data at each step Stochastic Gradient
Descent.
In Summary
Step 1: Take the derivative of the Loss Function for each parameter in
it. In fancy Machine Learning Lingo, take the Gradient of the Loss
Function.
Step 2: Pick random values for the parameters.
Step 3: Plug the parameter values into the derivatives (ahem, the Gradient).
Step 4: Calculate the Step Sizes:
Step 5: Calculate the New Parameters:

New Parameter = Old Parameter - Step Size
Go back to Step 3 and repeat until

Step Size is very small, or you reach
the Maximum Number of Steps.
TRIPLE BAM!!!

Statquest Linear Regression Study Guide V2-1adru0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statquest Linear Regression Study Guide V2-1adru0

Uploaded by

Copyright:

Available Formats

StatQuest!!!

Intercept Value Intercept Value Intercept Value

Sometimes, like for Linear Regression, there is an analytical

When there is no analytical solution,

Gradient Descent is an iterative procedure, it

…then improves the …it has found an optimal

© 2020 Joshua Starmer All Rights Reserved

Height = intercept + 0.64 x Weight

We use a Loss Function to evaluate

NOTE: The average of the SSR, the

Diﬀerent y-axis intercept The goal is to find the minimum

1.12 + 0.42 + 1.32 = 3.1 Gradient Decent

Residual = (Observed Height - Predicted Height) = (Observed - (intercept + 0.64 x Weight)

Observed Heights are the

A Loss Function: The Sum of Squared Residuals (SSR)

© 2020 Joshua Starmer All Rights Reserved Intercept Value

SSR = (Height - (intercept + 0.64 x Weight))2

+ (Height - (intercept + 0.64 x Weight))2

SSR This equation ……and the derivative

A large derivative A small derivative

Intercept Value Intercept Value

One way to take the derivative of

Step 1: Rewrite the SRR SSR = (Height - (intercept + 0.64 x Weight))2

Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1

Step 3: Plug the derivatives d SSR d SSR d Inside

Plug observed values for Weight and Height into

-2(Height - (Intercept + 0.64 x Weight))

+ -2(Height - (Intercept + 0.64 x Weight))

Observed Heights… …and observed Weights.

Initialize the variable we want to optimize (in this

In this example the initial

© 2020 Joshua Starmer All Rights Reserved

When the Intercept = 0 the

NOTE: The the magnitude of the slope

Step 4.1: Calculate the Step Size.

Take a step closer to the

New Intercept = Old Intercept - Step Size

© 2020 Joshua Starmer All Rights Reserved

Evaluate the derivative at the current

When the Intercept = 0.57,

The new slope shows Intercept Value

Take a step closer to the

New Intercept = Old Intercept - Step Size

The Old Intercept is the value = 0.57 - (-0.23) = 0.8

© 2020 Joshua Starmer All Rights Reserved

Evaluate the derivative at the current

When the Intercept = 0.8,

The new slope shows Intercept Value

Take a step closer to the

New Intercept = Old Intercept - Step Size

The Old Intercept is the value = 0.8 - (-0.09) = 0.89

Height Height = intercept + slope x Weight

This is a 3-D graph of the SSR for

This axis represents This axis represents

Just like before, the goal is to take steps

What did the bird

© 2020 Joshua Starmer All Rights Reserved

1) The derivative of the SSR with respect to the intercept:

Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2

Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1

Step 3: Plug the derivatives d SSR d SSR d Inside

2) The derivative of the SSR with respect to the slope:

Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2