You are on page 1of 19

SEN 503/339 Artificial Intelligence

Lec 5.5: Linear and Logistic Regression,


Regularization – More Notes

Halûk Gümüşkaya
Professor of Computer Engineering

web: http://www.gumuskaya.com
e-mail: haluk@gumuskaya.com, halukgumuskaya@aydin.edu.tr

: https://tr.linkedin.com/in/halukgumuskaya
: https://www.facebook.com/2haluk.gumuskaya

Linear and Logistic Regression, Regularization


1. How to Reduce Loss and Gradient Descent
2. Regularization

Haluk Gümüşkaya @ www.gumuskaya.com 2


How do We Reduce Loss?
• Hyperparameters are the configuration settings used to tune
how the model is trained.

• Derivative of (y - y')2 with respect to the weights and biases tells


us how loss changes for a given example.
• Simple to compute and convex.

• So we repeatedly take small steps in the direction that


minimizes loss
• We call these Gradient Steps (But they're really negative
Gradient Steps)
• This strategy is called Gradient Descent.

Haluk Gümüşkaya @ www.gumuskaya.com 3

Block Diagram of Gradient Descent


 The following figure suggests the iterative trial-and-error
process that machine learning algorithms use to train a model:

An iterative approach to training a model.


Haluk Gümüşkaya @ www.gumuskaya.com 4
Iterative Approach for Model (Prediction Function)
 We'll use this iterative approach throughout Machine/Deep
Learning Course, detailing various complications, particularly
within that stormy cloud labeled “Model (Prediction Function).”
 Iterative strategies are prevalent in machine learning, primarily
because they scale so well to large data sets.

Haluk Gümüşkaya @ www.gumuskaya.com 5

Linear Regression
 The "model" takes one or more features as input and returns
one prediction ( y’ ) as output.
 To simplify, consider a model that takes one feature and
returns one prediction:

 What initial values should we set for and ?


 For linear regression problems, it turns out that the starting
values aren't important.
 We could pick random values, but we'll just take the following
trivial values instead:

Haluk Gümüşkaya @ www.gumuskaya.com 6


Prediction Function and Compute Loss
 Suppose that the first feature value is 10.
 Plugging that feature value into the Prediction Function yields:

 The “Compute Loss” part of the diagram is the loss function


that the model will use.
 Suppose we use the squared loss function.
 The loss function takes in 2 input values:

Haluk Gümüşkaya @ www.gumuskaya.com 7

Compute Parameter Updates Part


 At last, we've reached the "Compute parameter updates" part
of the diagram.
 The machine learning system examines the value of the loss
function and generates new values for b and w1 .
 For now, just assume that this mysterious box devises new
values and then the machine learning system re-evaluates all
those features against all those labels, yielding a new value for
the loss function, which yields new parameter values.
 Learning continues iterating until the algorithm discovers the
model parameters with the lowest possible loss.
 Iterate until overall loss stops changing or at least changes
extremely slowly.
 When that happens, we say that the model has converged.

Haluk Gümüşkaya @ www.gumuskaya.com 8


Reducing Loss: Gradient Descent
 Suppose we calculated the loss for all possible values of w1 .
 For the kind of regression problems, the resulting plot of loss
vs. w1 will always be convex (bowl-shaped).

Regression problems yield convex loss vs. weight plots.


Haluk Gümüşkaya @ www.gumuskaya.com 9

Convex Problems: One Minimum


 Only one minimum; one place where the slope is exactly 0.
 That minimum is where the loss function converges.
 Calculating the loss function for every conceivable value of
over the entire data set would be an inefficient way of finding
the convergence point.

Haluk Gümüşkaya @ www.gumuskaya.com 10


Convex Problems: Gradient Descent
 A better mechanism — very popular in machine learning—
called gradient descent.
 The first stage: Pick a starting value (a starting point) for w1 .

 The starting point doesn't matter much; therefore, many


algorithms simply set to 0 or pick a random value.
 We've picked a starting point slightly greater than 0:

Haluk Gümüşkaya @ www.gumuskaya.com 11

Weight Initialization
 For convex problems, weights can start anywhere (say, all 0s).
 Convex: think of a bowl shape.
 Just one minimum.

 Foreshadowing: not true for neural nets


 Non-convex: think of an egg crate.
 More than one minimum.
 Strong dependency on initial values.

Haluk Gümüşkaya @ www.gumuskaya.com 12


Gradient Descent Algorithm
 The algorithm then calculates the gradient of the loss curve at
the starting point.
 Here in the figure, the gradient of the loss is equal to the
derivative (slope) of the curve, and tells you which way is
“warmer” or “colder.”
 When there are multiple weights, the gradient is a vector of
partial derivatives with respect to the weights.

Haluk Gümüşkaya @ www.gumuskaya.com 13

To learn more about partial derivatives and gradients

https://developers.google
.com/machine-
learning/crash-
course/reducing-
loss/gradient-descent

Haluk Gümüşkaya @ www.gumuskaya.com 14


Gradient Descent relies on Negative Gradients
 Note that a gradient is a vector, so it
has both of the following
characteristics: Gradient descent relies
 a direction on negative gradients.
 a magnitude
 The gradient always points in the
direction of steepest increase in the
loss function.

 The gradient descent algorithm


takes a step in the direction of the
negative gradient in order to
reduce loss as quickly as possible.
Haluk Gümüşkaya @ www.gumuskaya.com 15

A Gradient Step
 To determine the next point along the loss function curve, the
gradient descent algorithm adds some fraction of the gradient's
magnitude to the starting point as shown in the following figure:

A gradient step moves us to the next point on the loss curve.


Haluk Gümüşkaya @ www.gumuskaya.com 16
Determining the Next Point: Learning Rate
 Gradientdescent algorithms multiply the gradient by a scalar
known as the learning rate (also sometimes called step size) to
determine the next point.

 For example:
 If gradient magnitude = 2.5, and
 Learning rate is 0.01,
 Next point: 0.025 away from the previous point.

Haluk Gümüşkaya @ www.gumuskaya.com 17

Hyperparameters: Learning rate is too small


 Hyperparameters are the knobs that programmers tweak in
machine learning algorithms.
 Most machine learning programmers spend a fair amount of
time tuning the learning rate.
 If you pick a learning rate that is too small, learning will take
too long:

Learning rate is too small.


Haluk Gümüşkaya @ www.gumuskaya.com 18
Learning Rate is too Large
 If it is too large: The next point will perpetually bounce
haphazardly across the bottom of the well like a quantum
mechanics experiment gone horribly wrong:

Learning rate is too Large


Haluk Gümüşkaya @ www.gumuskaya.com 19

Learning rate is just right


 There's a Goldilocks learning rate for every regression problem.
 The Goldilocks value is related to how flat the loss function is.
 If you know the gradient of the loss function is small then you can
safely try a larger learning rate, which compensates for the small
gradient and results in a larger step size.

Learning rate is just right


Haluk Gümüşkaya @ www.gumuskaya.com 20
To Learn More: Ideal Learning Rate

Learning rate is just right


Haluk Gümüşkaya @ www.gumuskaya.com 21

Last Final Summary: Gradient Steps

Haluk Gümüşkaya @ www.gumuskaya.com 22


Reducing Loss: Optimizing Learning Rate
 Experiment with different learning rates and see how they
affect the number of steps required to reach the minimum of
the loss curve.
 Try the exercises below the graph.

https://developers.google.com/machine-learning/crash-
course/fitter/graph#exercise-1
Haluk Gümüşkaya @ www.gumuskaya.com 23

Learning Rate: 1.00

Haluk Gümüşkaya @ www.gumuskaya.com 24


Learning Rate: 4.00

This time gradient


descent never reaches
the minimum.

As a result, steps
progressively increase in
size.

Each step jumps back and


forth across the bowl,
climbing the curve instead
of descending to the
bottom.

Haluk Gümüşkaya @ www.gumuskaya.com 25

Epoch, Batch Size and Iterations


 One epoch = one forward pass and one backward pass of all
the training examples
 Batch size = the number of training examples in one
forward/backward pass.
 The higher the batch size, the more memory space you'll need.
 Number of Iterations = number of passes, each pass using
[batch size] number of examples.
 one pass = one forward pass + one backward pass
(we do not count the forward pass and backward pass as two different
passes).

 Example: if you have 1000 training examples, and your batch


size is 500, then it will take 2 iterations to complete 1 epoch.

Haluk Gümüşkaya @ www.gumuskaya.com 26


Reducing Loss: Stochastic Gradient Descent
and Mini-Batch Gradient Descent
 In gradient descent, a batch is the total number of examples
you use to calculate the gradient in a single iteration.
 So far, we've assumed that the batch has been the entire data
set.
 Could compute gradient over entire data set on each step, but
this turns out to be unnecessary.
 Computing gradient on small data samples works well
 On every step, get a new random sample.

 Batch Gradient Descent: All trainining samples at a time


 Stochastic Gradient Descent: one example at a time
 Mini-Batch Gradient Descent: batches of 10-1000
 Loss & gradients are averaged over the batch.

Haluk Gümüşkaya @ www.gumuskaya.com 27

Reducing Loss: Check Your Understanding


 Check Your Understanding: Batch Size

Haluk Gümüşkaya @ www.gumuskaya.com 28


Reducing Loss: Check Your Understanding
 Check Your Understanding: Batch Size

Haluk Gümüşkaya @ www.gumuskaya.com 29

Playground: Machine Learning Teaching Tool

Haluk Gümüşkaya @ www.gumuskaya.com 30


Reducing Loss: Playground Exercise
 Learning Rate and Convergence
 The is the first of several Playground exercises.
 Playground is a program developed especially for this course to
teach machine learning principles.
 Each Playground exercise generates a dataset.
 The label for this dataset has 2 possible values.
 You could think of those two possible values as spam vs. not
spam or perhaps healthy trees vs. sick trees.
 The goal of most exercises is to tweak various hyperparameters
to build a model that successfully classifies (separates or
distinguishes) one label value from the other.
 Note that most data sets contain a certain amount of noise that
will make it impossible to successfully classify every example.
https://developers.google.com/machine-learning/crash-course/reducing-
loss/playground-exercise
Haluk Gümüşkaya @ www.gumuskaya.com 31

Linear and Logistic Regression, Regularization


1. How to Reduce Loss and Gradient Descent
2. Regularization
3. …..

Haluk Gümüşkaya @ www.gumuskaya.com 32


Underfitting
 Underfittingoccurs when a model can’t accurately capture
the dependencies among data, usually as a consequence of
its own simplicity.

 Itoften yields a low 𝑅² with known data and bad


generalization capabilities when applied with new data.

Haluk Gümüşkaya @ www.gumuskaya.com 33

Overfitting
 Overfittinghappens when a model learns both dependencies
among data and random fluctuations.
 In other words, a model learns the existing data too well.
 Complex models, which have many features or terms, are
often prone to overfitting.
 When applied to known data, such models usually yield high
𝑅².
 However, they often don’t generalize
well and have significantly
lower 𝑅² when used with new data.

Haluk Gümüşkaya @ www.gumuskaya.com 34


Linear Regression: Underfitting Example

 This linear regression line has a low 𝑅².


 The straight line can’t take into account the fact that the actual response
increases as 𝑥 moves away from 25 towards zero.
 This is likely an example of underfitting.
Haluk Gümüşkaya @ www.gumuskaya.com 35

Polynomial Regression: Satisfactory Example

 Polynomial regression with the degree equal to 2.


 This might be the optimal degree for modeling this data.
 The model has a value of 𝑅² that is satisfactory in many cases and shows
trends nicely.
Haluk Gümüşkaya @ www.gumuskaya.com 36
Polynomial Regression: Signs of Overfitting

 Polynomial regression with the degree equal to 3.


 The value of 𝑅² is higher than in the preceding cases.
 This model behaves better with known data than the previous ones.
 However, it shows some signs of overfitting, especially for the input values close to
60 where the line starts decreasing, although actual data don’t show that.
Haluk Gümüşkaya @ www.gumuskaya.com 37

Polynomial: Perfect Fit or Overfitted Model?

 Perfect fit: six points and the polynomial line of the degree 5 (or higher)
yield 𝑅² = 1.
 Each actual response equals its corresponding prediction.
 In some situations, this might be exactly what you’re looking for.
 In many cases, however, this is an overfitted model. It is likely to have poor
behavior with unseen data, especially with the inputs larger than 50.
Haluk Gümüşkaya @ www.gumuskaya.com 38

You might also like