05.5 SEN503 339 AI Regression More Notes

SEN 503/339 Artificial Intelligence
Lec 5.5: Linear and Logistic Regression,

Regularization – More Notes
Halûk Gümüşkaya
Professor of Computer Engineering
web: http://www.gumuskaya.com
e-mail: haluk@gumuskaya.com, halukgumuskaya@aydin.edu.tr
: https://tr.linkedin.com/in/halukgumuskaya
: https://www.facebook.com/2haluk.gumuskaya
Linear and Logistic Regression, Regularization

1. How to Reduce Loss and Gradient Descent
2. Regularization
Haluk Gümüşkaya @ www.gumuskaya.com 2

How do We Reduce Loss?
• Hyperparameters are the configuration settings used to tune
how the model is trained.
• Derivative of (y - y')2 with respect to the weights and biases tells

us how loss changes for a given example.
• Simple to compute and convex.
• So we repeatedly take small steps in the direction that

minimizes loss
• We call these Gradient Steps (But they're really negative
Gradient Steps)
• This strategy is called Gradient Descent.
Block Diagram of Gradient Descent

 The following figure suggests the iterative trial-and-error
process that machine learning algorithms use to train a model:
An iterative approach to training a model.

Iterative Approach for Model (Prediction Function)
 We'll use this iterative approach throughout Machine/Deep
Learning Course, detailing various complications, particularly
within that stormy cloud labeled “Model (Prediction Function).”
 Iterative strategies are prevalent in machine learning, primarily
because they scale so well to large data sets.
Linear Regression
 The "model" takes one or more features as input and returns
one prediction ( y’ ) as output.
 To simplify, consider a model that takes one feature and
returns one prediction:
 What initial values should we set for and ?

 For linear regression problems, it turns out that the starting
values aren't important.
 We could pick random values, but we'll just take the following
trivial values instead:

Prediction Function and Compute Loss
 Suppose that the first feature value is 10.
 Plugging that feature value into the Prediction Function yields:
 The “Compute Loss” part of the diagram is the loss function

that the model will use.
 Suppose we use the squared loss function.
 The loss function takes in 2 input values:
Compute Parameter Updates Part

 At last, we've reached the "Compute parameter updates" part
of the diagram.
 The machine learning system examines the value of the loss
function and generates new values for b and w1 .
 For now, just assume that this mysterious box devises new
values and then the machine learning system re-evaluates all
those features against all those labels, yielding a new value for
the loss function, which yields new parameter values.
 Learning continues iterating until the algorithm discovers the
model parameters with the lowest possible loss.
 Iterate until overall loss stops changing or at least changes
extremely slowly.
 When that happens, we say that the model has converged.

Reducing Loss: Gradient Descent
 Suppose we calculated the loss for all possible values of w1 .
 For the kind of regression problems, the resulting plot of loss
vs. w1 will always be convex (bowl-shaped).
Regression problems yield convex loss vs. weight plots.

Convex Problems: One Minimum

 Only one minimum; one place where the slope is exactly 0.
 That minimum is where the loss function converges.
 Calculating the loss function for every conceivable value of
over the entire data set would be an inefficient way of finding
the convergence point.

Convex Problems: Gradient Descent
 A better mechanism — very popular in machine learning—
called gradient descent.
 The first stage: Pick a starting value (a starting point) for w1 .
 The starting point doesn't matter much; therefore, many

algorithms simply set to 0 or pick a random value.
 We've picked a starting point slightly greater than 0:
Weight Initialization
 For convex problems, weights can start anywhere (say, all 0s).
 Convex: think of a bowl shape.
 Just one minimum.
 Foreshadowing: not true for neural nets

 Non-convex: think of an egg crate.
 More than one minimum.
 Strong dependency on initial values.

Gradient Descent Algorithm
 The algorithm then calculates the gradient of the loss curve at
the starting point.
 Here in the figure, the gradient of the loss is equal to the
derivative (slope) of the curve, and tells you which way is
“warmer” or “colder.”
 When there are multiple weights, the gradient is a vector of
partial derivatives with respect to the weights.
To learn more about partial derivatives and gradients
https://developers.google
.com/machine-
learning/crash-
course/reducing-
loss/gradient-descent

Gradient Descent relies on Negative Gradients
 Note that a gradient is a vector, so it
has both of the following
characteristics: Gradient descent relies
 a direction on negative gradients.
 a magnitude
 The gradient always points in the
direction of steepest increase in the
loss function.
 The gradient descent algorithm

takes a step in the direction of the
negative gradient in order to
reduce loss as quickly as possible.
A Gradient Step
 To determine the next point along the loss function curve, the
gradient descent algorithm adds some fraction of the gradient's
magnitude to the starting point as shown in the following figure:
A gradient step moves us to the next point on the loss curve.

Determining the Next Point: Learning Rate
 Gradientdescent algorithms multiply the gradient by a scalar
known as the learning rate (also sometimes called step size) to
determine the next point.
 For example:
 If gradient magnitude = 2.5, and
 Learning rate is 0.01,
 Next point: 0.025 away from the previous point.
Hyperparameters: Learning rate is too small

 Hyperparameters are the knobs that programmers tweak in
machine learning algorithms.
 Most machine learning programmers spend a fair amount of
time tuning the learning rate.
 If you pick a learning rate that is too small, learning will take
too long:
Learning rate is too small.

Learning Rate is too Large
 If it is too large: The next point will perpetually bounce
haphazardly across the bottom of the well like a quantum
mechanics experiment gone horribly wrong:
Learning rate is too Large

Learning rate is just right

 There's a Goldilocks learning rate for every regression problem.
 The Goldilocks value is related to how flat the loss function is.
 If you know the gradient of the loss function is small then you can
safely try a larger learning rate, which compensates for the small
gradient and results in a larger step size.

To Learn More: Ideal Learning Rate

Last Final Summary: Gradient Steps

Reducing Loss: Optimizing Learning Rate
 Experiment with different learning rates and see how they
affect the number of steps required to reach the minimum of
the loss curve.
 Try the exercises below the graph.
https://developers.google.com/machine-learning/crash-
course/fitter/graph#exercise-1
Learning Rate: 1.00

Learning Rate: 4.00
This time gradient

descent never reaches
the minimum.
As a result, steps
progressively increase in
size.
Each step jumps back and

forth across the bowl,
climbing the curve instead
of descending to the
bottom.
Epoch, Batch Size and Iterations

 One epoch = one forward pass and one backward pass of all
the training examples
 Batch size = the number of training examples in one
forward/backward pass.
 The higher the batch size, the more memory space you'll need.
 Number of Iterations = number of passes, each pass using
[batch size] number of examples.
 one pass = one forward pass + one backward pass
(we do not count the forward pass and backward pass as two different
passes).
 Example: if you have 1000 training examples, and your batch

size is 500, then it will take 2 iterations to complete 1 epoch.

Reducing Loss: Stochastic Gradient Descent
and Mini-Batch Gradient Descent
 In gradient descent, a batch is the total number of examples
you use to calculate the gradient in a single iteration.
 So far, we've assumed that the batch has been the entire data
set.
 Could compute gradient over entire data set on each step, but
this turns out to be unnecessary.
 Computing gradient on small data samples works well
 On every step, get a new random sample.
 Batch Gradient Descent: All trainining samples at a time

 Stochastic Gradient Descent: one example at a time
 Mini-Batch Gradient Descent: batches of 10-1000
 Loss & gradients are averaged over the batch.
Reducing Loss: Check Your Understanding

 Check Your Understanding: Batch Size

Reducing Loss: Check Your Understanding
 Check Your Understanding: Batch Size
Playground: Machine Learning Teaching Tool

Reducing Loss: Playground Exercise
 Learning Rate and Convergence
 The is the first of several Playground exercises.
 Playground is a program developed especially for this course to
teach machine learning principles.
 Each Playground exercise generates a dataset.
 The label for this dataset has 2 possible values.
 You could think of those two possible values as spam vs. not
spam or perhaps healthy trees vs. sick trees.
 The goal of most exercises is to tweak various hyperparameters
to build a model that successfully classifies (separates or
distinguishes) one label value from the other.
 Note that most data sets contain a certain amount of noise that
will make it impossible to successfully classify every example.
https://developers.google.com/machine-learning/crash-course/reducing-
loss/playground-exercise
Linear and Logistic Regression, Regularization

1. How to Reduce Loss and Gradient Descent
2. Regularization
3. …..

Underfitting
 Underfittingoccurs when a model can’t accurately capture
the dependencies among data, usually as a consequence of
its own simplicity.
 Itoften yields a low 𝑅² with known data and bad

generalization capabilities when applied with new data.
Overfitting
 Overfittinghappens when a model learns both dependencies
among data and random fluctuations.
 In other words, a model learns the existing data too well.
 Complex models, which have many features or terms, are
often prone to overfitting.
 When applied to known data, such models usually yield high
𝑅².
 However, they often don’t generalize
well and have significantly
lower 𝑅² when used with new data.

Linear Regression: Underfitting Example
 This linear regression line has a low 𝑅².

 The straight line can’t take into account the fact that the actual response
increases as 𝑥 moves away from 25 towards zero.
 This is likely an example of underfitting.
Polynomial Regression: Satisfactory Example
 Polynomial regression with the degree equal to 2.

 This might be the optimal degree for modeling this data.
 The model has a value of 𝑅² that is satisfactory in many cases and shows
trends nicely.
Polynomial Regression: Signs of Overfitting
 Polynomial regression with the degree equal to 3.

 The value of 𝑅² is higher than in the preceding cases.
 This model behaves better with known data than the previous ones.
 However, it shows some signs of overfitting, especially for the input values close to
60 where the line starts decreasing, although actual data don’t show that.
Polynomial: Perfect Fit or Overfitted Model?
 Perfect fit: six points and the polynomial line of the degree 5 (or higher)
yield 𝑅² = 1.
 Each actual response equals its corresponding prediction.
 In some situations, this might be exactly what you’re looking for.
 In many cases, however, this is an overfitted model. It is likely to have poor
behavior with unseen data, especially with the inputs larger than 50.

05.5 SEN503 339 AI Regression More Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05.5 SEN503 339 AI Regression More Notes

Uploaded by

Copyright:

Available Formats

SEN 503/339 Artificial Intelligence

Lec 5.5: Linear and Logistic Regression,

Linear and Logistic Regression, Regularization

Haluk Gümüşkaya @ www.gumuskaya.com 2

• Derivative of (y - y')2 with respect to the weights and biases tells

• So we repeatedly take small steps in the direction that

Haluk Gümüşkaya @ www.gumuskaya.com 3

Block Diagram of Gradient Descent

An iterative approach to training a model.

Haluk Gümüşkaya @ www.gumuskaya.com 5

 What initial values should we set for and ?

Haluk Gümüşkaya @ www.gumuskaya.com 6

 The “Compute Loss” part of the diagram is the loss function

Haluk Gümüşkaya @ www.gumuskaya.com 7

Compute Parameter Updates Part

Haluk Gümüşkaya @ www.gumuskaya.com 8

Regression problems yield convex loss vs. weight plots.

Convex Problems: One Minimum

Haluk Gümüşkaya @ www.gumuskaya.com 10

 The starting point doesn't matter much; therefore, many

Haluk Gümüşkaya @ www.gumuskaya.com 11

 Foreshadowing: not true for neural nets

Haluk Gümüşkaya @ www.gumuskaya.com 12

Haluk Gümüşkaya @ www.gumuskaya.com 13

To learn more about partial derivatives and gradients

Haluk Gümüşkaya @ www.gumuskaya.com 14

 The gradient descent algorithm

A gradient step moves us to the next point on the loss curve.

Haluk Gümüşkaya @ www.gumuskaya.com 17

Hyperparameters: Learning rate is too small

Learning rate is too small.

Learning rate is too Large

Learning rate is just right

Learning rate is just right

Learning rate is just right

Last Final Summary: Gradient Steps

Haluk Gümüşkaya @ www.gumuskaya.com 22

Learning Rate: 1.00

Haluk Gümüşkaya @ www.gumuskaya.com 24

This time gradient

Each step jumps back and

Haluk Gümüşkaya @ www.gumuskaya.com 25

Epoch, Batch Size and Iterations

 Example: if you have 1000 training examples, and your batch

Haluk Gümüşkaya @ www.gumuskaya.com 26

 Batch Gradient Descent: All trainining samples at a time

Haluk Gümüşkaya @ www.gumuskaya.com 27

Reducing Loss: Check Your Understanding

Haluk Gümüşkaya @ www.gumuskaya.com 28

Haluk Gümüşkaya @ www.gumuskaya.com 29

Playground: Machine Learning Teaching Tool

Haluk Gümüşkaya @ www.gumuskaya.com 30

Linear and Logistic Regression, Regularization

Haluk Gümüşkaya @ www.gumuskaya.com 32

 Itoften yields a low 𝑅² with known data and bad

Haluk Gümüşkaya @ www.gumuskaya.com 33

Haluk Gümüşkaya @ www.gumuskaya.com 34

 This linear regression line has a low 𝑅².

Polynomial Regression: Satisfactory Example

 Polynomial regression with the degree equal to 2.

 Polynomial regression with the degree equal to 3.

Polynomial: Perfect Fit or Overfitted Model?

You might also like