You are on page 1of 14

StatQuest!!!

Gradient
Descent

SSR

Intercept Value Intercept Value Intercept Value

Study Guide!!!
© 2020 Joshua Starmer All Rights Reserved
A major part of Machine Learning is
The Problem optimizing a model’s fit to the data. Neural Networks optimize
the weights associated
For example, when doing with each line that
Logistic Regression, we connects nodes.
need to find the squiggly line
that fits the data the best.

Sometimes, like for Linear Regression, there is an analytical


solution, where you plug numbers into an equation and get
the best parameters. Bam! But it’s not always that easy.

When there is no analytical solution,


The Solution - Gradient Descent Gradient Descent can save the day!

Gradient Descent is an iterative procedure, it


incrementally steps towards an optimal solution, that can
be applied in a very wide variety of situations.

…then improves the …it has found an optimal


It starts with an
guess, one step at a solution or reaches a
initial guess…
time, until… maximum number of steps.

NOTES:

© 2020 Joshua Starmer All Rights Reserved


The Main Ideas: Gradient Descent is used to optimize
parameters. In this example, we Later we will show how
want to optimize the y-axis intercept to optimize 2 or more
for this line. parameters.

Height = intercept + 0.64 x Weight


Height
NOTE: For now, the slope, 0.64,
is the Least Squares estimate.

We use a Loss Function to evaluate


Weight candidate parameter values.
In this example, the Loss
Since we’re optimizing the
Function is the Sum of the
y-axis intercept, we’ll start 1.12 + 0.42 + 1.32 = 3.1
Squared Residuals (SSR).
by setting it to 0, but any
value will do.
A residual is the difference
between the observed value and
the value predicted by the line.
Height = 0 + slope x Weight

NOTE: The average of the SSR, the


Mean Squared Error (MSE), is
another popular Loss Function.

Different y-axis intercept The goal is to find the minimum


values result in different SSR, but testing every possible
Sums of the Squared value would take forever.
Residuals (SSR).

1.12 + 0.42 + 1.32 = 3.1 Gradient Decent


solves this problem by
testing relatively few
values far from an

optimal solution
and increasing
the number of
SSR values tested the
closer it gets to
the optimal
By eye, this looks like the solution.
minimum SSR, but another
intercept value might be better.
Intercept Value
© 2020 Joshua Starmer All Rights Reserved
Residuals are the difference between the
Residuals: Observed and Predicted values.

Residual = (Observed Height - Predicted Height) = (Observed - (intercept + 0.64 x Weight)

Observed Heights are the


values we measured.
Predicted Heights come We can plug the
from the equation for the equation in for the line
equation for the line: in for the Predicted
value.
Height
X
X
Predicted Height = intercept + 0.64 x Weight

X
Weight

A Loss Function: The Sum of Squared Residuals (SSR)


Sum of Squared Residuals (SSR) = (Height - (intercept + 0.64 x Weight))2
The equation for
+ (Height - (intercept + 0.64 x Weight))2
There is one term in the SSR…
the sum for each + (Height - (intercept + 0.64 x Weight))2
observed point.
Plugging in different
values for the intercept
gives us different sums
of squared residuals.
SSR

Intercept Value
…corresponds
to the teal line.
The goal is to find the intercept
value that results in the minimal
SSR, and that corresponds to
the lowest point in the curve.

SSR

© 2020 Joshua Starmer All Rights Reserved Intercept Value


The goal is to step towards a minimum
Minimizing the SSR: SSR from a random starting point.

SSR = (Height - (intercept + 0.64 x Weight))2


+ (Height - (intercept + 0.64 x Weight))2

+ (Height - (intercept + 0.64 x Weight))2

SSR This equation ……and the derivative


corresponds to calculates the slope for any
this curve… value for the intercept.
Intercept Value

A large derivative A small derivative


suggests we are suggests we are
relatively far from the relatively close to
bottom… the bottom.
SSR

Intercept Value Intercept Value


X X
A negative derivative tells us that A positive derivative tells us that
the bottom is to the right of the the bottom is to the left of the
current intercept value. current intercept value.

One way to take the derivative of


Calculating the Derivative of the SSR: the SSR is to use The Chain Rule.

Step 1: Rewrite the SRR SSR = (Height - (intercept + 0.64 x Weight))2


as a function of Inside,
which is a function of the
intercept. SSR = (Inside)2 Inside = Height - (intercept + 0.64 x Weight)

Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1


of SRR and Inside. d Inside d intercept

Step 3: Plug the derivatives d SSR d SSR d Inside


= x
into The Chain Rule. d intercept d Inside d intercept

BAM! d SSR
= 2 x Inside x -1 = -2 (Height - (intercept + 0.64 x Weight))
d intercept
© 2020 Joshua Starmer All Rights Reserved
Gradient Boost for One Parameter, Step-by-Step

Plug observed values for Weight and Height into


Step 1: the derivative of the Loss Function.

d
Sum of Squared Residuals =
d intercept

-2(Height - (Intercept + 0.64 x Weight))


Height
+ -2(Height - (Intercept + 0.64 x Weight))

+ -2(Height - (Intercept + 0.64 x Weight))

Weight

d
Sum of Squared Residuals =
d intercept
-2(1.4 - (Intercept + 0.64 x 0.5))
+ -2(1.9 - (Intercept + 0.64 x 2.3))
+ -2(3.2 - (Intercept + 0.64 x 2.9))

Observed Heights… …and observed Weights.

Initialize the variable we want to optimize (in this


Step 2: case the Intercept) with a random value.

d
Sum of Squared Residuals =
d intercept
-2(1.4 - (0 + 0.64 x 0.5))
+ -2(1.9 - (0 + 0.64 x 2.3))
+ -2(3.2 - (0 + 0.64 x 2.9))

In this example the initial


value for the Intercept is 0.

© 2020 Joshua Starmer All Rights Reserved


Evaluate the derivative at the
Step 3.1: current value for the Intercept, 0.

When the Intercept = 0 the


d derivative, or slope, is -5.7.
Sum of Squared Residuals =
d intercept
-2(1.4 - (0 + 0.64 x 0.5))
+ -2(1.9 - (0 + 0.64 x 2.3)) = -5.7 Sum of
+ -2(3.2 - (0 + 0.64 x 2.9)) Squared
Residuals

Intercept Value

NOTE: The the magnitude of the slope


proportional to how big of a step we
should take towards the minimum. The
sign (+/-) tells us what direction.

Step 4.1: Calculate the Step Size.


The Learning Rate prevents us from
The slope is the derivative taking steps that are too large and is user
evaluated at the current defined. NOTE: 0.01 is a common default
value for the Intercept. value, but we are using 0.1 in this
example.
Step Size = Slope x Learning Rate
= -5.7 x 0.1
= -0.57

Take a step closer to the


Step 5.1: optimal value for the Intercept.

New Intercept = Old Intercept - Step Size

The Old Intercept is the value = 0 - (-0.57) = 0.57 The new value for
used to determine the current the Intercept,
slope. In this case, it is 0. 0.57, moves the
line up a little bit.

© 2020 Joshua Starmer All Rights Reserved


Repeat steps 3, 4 and 5, using the new value for the
intercept until the Step Size is close to 0 or you take
the maximum number of steps.

Evaluate the derivative at the current


Step 3.2: value for the Intercept, 0.57.

When the Intercept = 0.57,


d the derivative, or
Sum of Squared Residuals = slope, is -2.3.
d intercept
-2(1.4 - (0.57 + 0.64 x 0.5))
+ -2(1.9 - (0.57 + 0.64 x 2.3)) = -2.3 Sum of
+ -2(3.2 - (0.57 + 0.64 x 2.9)) Squared
Residuals

The new slope shows Intercept Value


that we have taken a
step towards the lowest
point in the curve.

Step 4.2: Calculate the Step Size. NOTE: The Step Size is smaller
than before because the slope is
Step Size = Slope x Learning Rate not as steep as before. This
= -2.3 x 0.1 means we are getting closer to
= -0.23 the minimum value.

Take a step closer to the


Step 5.2: optimal value for the Intercept.

New Intercept = Old Intercept - Step Size

The Old Intercept is the value = 0.57 - (-0.23) = 0.8


The new value for
used to determine the current the Intercept, 0.8,
slope. In this case, it is 0.57.
moves the line up
a little bit more.

© 2020 Joshua Starmer All Rights Reserved


Repeat steps 3, 4 and 5, using the new value for the
intercept until the Step Size is close to 0 or you take
the maximum number of steps.

Evaluate the derivative at the current


Step 3.3: value for the Intercept, 0.8.

When the Intercept = 0.8,


d the derivative, or
Sum of Squared Residuals = slope, is -0.9.
d intercept
-2(1.4 - (0.8 + 0.64 x 0.5))
+ -2(1.9 - (0.8 + 0.64 x 2.3)) = -0.9 Sum of
+ -2(3.2 - (0.8 + 0.64 x 2.9)) Squared
Residuals

The new slope shows Intercept Value


that we have taken a
step towards the lowest
point in the curve.

Step 4.3: Calculate the Step Size. NOTE: The Step Size is smaller
than before because the slope is
Step Size = Slope x Learning Rate not as steep as before. This
= -0.9 x 0.1 means we are getting closer to
= -0.09 the minimum value.

Take a step closer to the


Step 5.3: optimal value for the Intercept.

New Intercept = Old Intercept - Step Size

The Old Intercept is the value = 0.8 - (-0.09) = 0.89


The new value for
used to determine the current the Intercept, 0.89,
slope. In this case, it is 0.8.
moves the line up a
little bit more.
Repeat steps 3, 4 and 5, using the new value
for the intercept until the Step Size is close to
0 or you take the maximum number of steps. BAM!
© 2020 Joshua Starmer All Rights Reserved
Optimizing 2 or More Parameters
In this example we will optimize
the intercept and the slope.

Height Height = intercept + slope x Weight

Weight

This is a 3-D graph of the SSR for


different values for the Intercept
and the Slope

This axis represents This axis represents


different values for the different values for the
Slope. Intercept.

Just like before, the goal is to take steps


This axis is the SSR.
towards the bottom of the graph, where
we minimize the Loss Function.

NOTES:

What did the bird


say when it stubbed
its toe? Owl!!!

© 2020 Joshua Starmer All Rights Reserved


For each variable, we take derivative
Taking partial derivatives of the SSR: of the SSR with The Chain Rule.

1) The derivative of the SSR with respect to the intercept:

Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2


as a function of Inside,
which is a function of the
intercept. SSR = (Inside)2 Inside = Height - (intercept + slope x Weight)

Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 + -1 + 0 = -1


of SRR and Inside. d Inside d intercept

Step 3: Plug the derivatives d SSR d SSR d Inside


= x
into The Chain Rule. d intercept d Inside d intercept

d SSR
= 2 x Inside x -1 = -2 (Height - (intercept + slope x Weight))
d intercept

2) The derivative of the SSR with respect to the slope:

Step 1: Rewrite the SRR SSR = (Height - (intercept + slope x Weight))2


as a function of Inside,
which is a function of the
intercept. SSR = (Inside)2 Inside = Height - (intercept + slope x Weight)

Step 2: Take the derivatives d SSR = 2 x Inside d Inside = 0 - 0 - Weight = -Weight


of SRR and Slope. d Inside d slope

Step 3: Plug the derivatives d SSR d SSR d Inside


= x
into The Chain Rule. d slope d Inside d slope

d SSR
= 2 x Inside x -Weight
d slope
= 2 x (Height - (intercept + slope x Weight)) x -Weight

= -2 x Weight(Height - (intercept + slope x Weight))

Step 4: Double Bam!!!


© 2020 Joshua Starmer All Rights Reserved
Gradient Boost for 2 or More Parameters, Step-by-Step

Plug observed values for Weight and Height into


Step 1: the derivatives of the Loss Function.

d SSR = -2(Height - (Intercept + Slope x Weight))


d intercept + -2(Height - (Intercept + Slope x Weight))
+ -2(Height - (Intercept + Slope x Weight))

Height
d SSR = -2(1.4 - (Intercept + Slope x 0.5))
d intercept
+ -2(1.9 - (Intercept + Slope x 2.3))
+ -2(3.2 - (Intercept + Slope x 2.9))
Weight

d SSR = -2 x Weight(Height - (Intercept + Slope x Weight))


d slope
= -2 x Weight(Height - (Intercept + Slope x Weight))
= -2 x Weight(Height - (Intercept + Slope x Weight))

d SSR = -2 x 0.5(1.4 - (Intercept + Slope x 0.5))


d slope
= -2 x 2.3(1.9 - (Intercept + Slope x 2.3))
= -2 x 2.9(3.2 - (Intercept + Slope x 2.9))

Initialize the variables we want to optimize (in this case


Step 2: the Intercept and the Slope) with random values.
In this example, the
initial value for the
Slope is 1…

d SSR = -2(1.4 - (0 + 1 x 0.5))


d intercept
+ -2(1.9 - (0 + 1 x 2.3))
+ -2(3.2 - (0 + 1 x 2.9))

d SSR = -2 x 0.5(1.4 - (0 + 1 x 0.5))


d slope
= -2 x 2.3(1.9 - (0 + 1 x 2.3))
…and the initial value = -2 x 2.9(3.2 - (0 + 1 x 2.9))
for the Intercept is 0.

© 2020 Joshua Starmer All Rights Reserved


Evaluate the derivatives for the current
Step 3: values for the Intercept, 0, and Slope, 1.

d SSR d SSR
= -2(1.4 - (0 + 1 x 0.5)) = -2 x 0.5(1.4 - (0 + 1 x 0.5))
d intercept d slope
+ -2(1.9 - (0 + 1 x 2.3)) = -1.6 = -2 x 2.3(1.9 - (0 + 1 x 2.3)) = -0.8
+ -2(3.2 - (0 + 1 x 2.9)) = -2 x 2.9(3.2 - (0 + 1 x 2.9))

Calculate the
Step 4: Step Sizes.

Step SizeIntercept = Derivative x Learning Rate Step SizeSlope = Derivative x Learning Rate
= -1.6 x 0.01 = -0.8 x 0.01
= -0.016 = -0.008

NOTE: We are using a


The good news is that, in
smaller Learning Rate (0.01)
practice, a good Learning Rate
than before (0.1) because
can be determined automatically
Gradient Descent can be
by starting large and getting
very sensitive to this
smaller with each step.
parameter.

Take a step closer to the


Step 5: optimal values for the
Intercept and Slope

New Intercept = Old Intercept - Step SizeIntercept New Slope = Old Slope - Step SizeSlope

= 0 - (-0.016) = 0.016 = 1 - (-0.008) = 1.008

The new values for the


Intercept, 0.016, and
Repeat steps 3, 4 and 5, using the
Slope, 1.008, move the line new value for the intercept until the
up and increase the slope a Step Size is close to 0 or you take
little bit. the maximum number of steps.
Double BAM!
© 2020 Joshua Starmer All Rights Reserved
Additional Notes:
Loss Functions:

The Sum of the Squared However, there are tons of Regardless of which Loss
Residuals is just one type other Loss Functions that Function you use, Gradient
of Loss Function. work with other types of data. Descent works the same way.

Stochastic Gradient Descent:


When we use a random
When we have lots of We can speed things up by subset instead of the full
data, Gradient Descent using a randomly selected dataset, we are doing
can be slow. subset of data at each step Stochastic Gradient
Descent.

In Summary

Step 1: Take the derivative of the Loss Function for each parameter in
it. In fancy Machine Learning Lingo, take the Gradient of the Loss
Function.

Step 2: Pick random values for the parameters.

Step 3: Plug the parameter values into the derivatives (ahem, the Gradient).

Step 4: Calculate the Step Sizes:

Step 5: Calculate the New Parameters:


New Parameter = Old Parameter - Step Size

Go back to Step 3 and repeat until


Step Size is very small, or you reach
the Maximum Number of Steps.

TRIPLE BAM!!!
© 2020 Joshua Starmer All Rights Reserved

You might also like