Lecture 22 - Loss Functions and Optimization

Principles and Application of Data Science
Loss Functions & Optimization

Lecture 22
SNU Graduate School of Data Science

Foundation of Data Science SNU Graduate School of Data Science 1
Review
Principles and Application of Data Science SNU Graduate School of Data Science 2
Linear Classifier
Parametric Approach
Instead of memorizing all training examples,
think of a function f that maps the input (image x) to the label scores (class y).
Label y
Image x
f(x)
Parametric Approach: Linear Classifier
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels
Image x Weights or parameters W
corresponding to each pixel
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels: W1,1x1,1 + W1,2x1,2 + … + WM,NxM,N
This is the score

Image x Weights or parameters W for class airplane.
Label y
We make 10
different W for
each class,
similarly.
We need to determine the values in W, such that each class gets the highest
score for correct images.
For this example, class ‘cat’ should have highest score. Label y
Image x
Wx||
f(x,W)
Weights or
parameters W
Let’s check dimensionality of each variable:
Label y
Image x
f(x,W)
10
? × ?1
32 × 32 × 3 10 independent
classifiers f
(=3072 numbers) (for each class)
Label y
Image x
3072
? × ?1
f(x,W) =10?Wx
× 3072
?
10
? × ?1
32 × 32 × 3 10 independent Weights or
classifiers f parameters W
(=3072 numbers) (for each class)
Label y
Image x
3072
? × ?1
f(x,W) =10?Wx
× 3072
?
+ b
?10× ×? 1
10
? × ?1
Bias b: affecting
the output
32 × 32 × 3 10 independent Weights or without
classifiers f parameters W interacting with
(=3072 numbers) (for each class) the data x
A separate variable b makes things complicated...
3072 × 1
x
f(x,W) =10 Wx
10 × 1
× 3072
+ b
10 × 1
= [W b]
10 × 3073
[1 ]
3073 × 1
Let’s call
W’ x’
Thus, , and we may omit
f(x,W) = W’x’ ’’
So, this is out final linear classifier model:
f(x,W) = Wx
Advantages of parametric models:
• Once trained, what we need is the weights W only. We do not need to store
the huge training dataset. → space efficient
• At test time, an example can be evaluated by a single matrix-vector
multiplication (Wx). → much faster than comparing against all training data.
Illustration
Example with an image with 4 pixels and 3 classes (cat/dog/ship)
Interpreting a Linear Classifier: Geometric Viewpoint
• Each linear boundary
comes from the
corresponding row of W.
• If the values in W change,

the corresponding decision
boundary will rotate.
• With a change in b, the

corresponding decision
boundary will go up/down.
Interpreting a Linear Classifier: Visual Viewpoint
Interpreting a Linear Classifier: Visual Viewpoint
What the linear classifier does is
● At training: learning the template from
training data
● At testing: performing the template
matching with a new example
Similar to kNN in that both compare distances, but

different from kNN in that linear classifier
compares only to K class templates, while kNN
does to N training examples. (usually, K << N)
Limitations of Linear Classifier
What is the meaning of the score?
• It is unbounded: can be
arbitrarily large.
• Hard to interpret: what is the
meaning of 437.9? How good is
it?
• It’d be nicer if we get a bounded

score between 0 and 1, so that we
can interpret it as a probability.
Limitations of Linear Classifier
What is the meaning of the score?
• Let’s define how much we are
confident that this image belongs
to each class.
• For 2 classes:
if s1 > s2: we believe this image is
in class 1 more than in class 2.
• The larger the gap (s1 - s2) is, the
more confident we are that x ∈ c1.
• Similarly, the other way also holds:
the larger s2 - s1 is, more likely x ∈
c2.
Sigmoid Function
Given the score difference s (= s1 - s2), we
want a function that maps
• larger s closer to 1
• smaller (large negative) s closer to 0
• 0 to 0.5.
We may use sigmoid function σ(s) as a
bounded score for a class!
Softmax Classifier
Let’s check if this is really a probability:
1) 0 ≤ p(y=cn|x) ≤ 1 (for n = 1, 2)
2) p(y=c1|x) + p(y=c2|x) = 1
Generalizing this to n > 2 classes:

Softmax
Function
Softmax Classifier
Coming back to this example,
• From the estimated probability,

our softmax classifier chooses
class 2 (dog) with almost 100%
confidence.
7e-260 ≈1 5e-122
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
Loss Functions
estimating how good/bad we are currently.
• Update the parameters (W) based on this loss. Optimization
• Repeat this until we get y^ ≈ y.
Loss Functions
Loss Function
A loss function quantifies how good (or bad) our current machine learning
model is.
• A function of our prediction y^ and ground truth label y:
• Depending on how y^ and y are different, it outputs some positive
number to penalize the model.
• If y^ = y, we do not want to penalize the model, so the loss should be (around) 0.
• If y^ ≈ y, we might want to penalize the model to fine-tune.
• If the gap between y^ and y is huge, we should heavily penalize.
• So, how much y^ and y are different?
Loss Function: Probabilistic Setting
• For a binary classification problem, the ground truth y = {0, 1}.
• Our model predicts one score y^, probability of one class.
• The other class probability is 1 - y^.
• Taking sigmoid function on the score difference is an example.
• Naturally extends to K > 2 classes:

• Ground truth is represented as an one-hot vector: y = [0, 0, 0, 0, 1, 0, 0]
• The model predicts K - 1 scores, leaving the last one as 1 - sum.
• Softmax loss is an example: predictions are between 0 and 1, and sum to 1.
• In this setting, the loss function basically compares two (GT and
predicted) probability distributions.
Cross Entropy
● From information theory: Measures the average number of bits needed
to identify an event drawn from the set.
○ General definition:
○ Binary classification:
Cross Entropy
● Observation: Even if it sums over all classes (K), only one term survives,
as we have yik = 1 for only one k.
Ti: ground truth
index for example i
→ This is the sum of -log(“our predicted y = log x

probability for the true class”) over all samples.
● Since “our predicted probability” is between 0 and 1,
it corresponds to the orange range shown left. 0 1
→ If our estimate gets closer to 1, the loss converges
to 0. If our estimates gets closer to 0, the loss
increases. y = - log x
Optimization
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
Loss Functions
• Update the parameters (W) based on this loss. Optimization
Optimization?
[Wikipedia] Mathematical optimization or mathematical programming is
the selection of a best element (with regard to some criterion) from some set
of available alternatives.

Optimization!
[Wikipedia] Mathematical optimization or mathematical programming is
the selection of a best element (with regard to some criterion) from some set
of available alternatives.

Primitive Ideas
• Exhaustive search
• Random search
• Visualization
• Greedy search (solving each variable one by one)
• ....

Following the slope
• We have a cost function
J(Θ) we want to minimize.
• Idea: for the current value of

Θ, calculate gradient of
J(Θ), then take small step in
direction of negative
gradient. Repeat.
• Recall that gradient (or
derivative) of a function at a
point is the best linear
approximation to the
function at that point.

Gradient Descent
• Update equation (in matrix notation): J(Θ)
• Update equation (for a single parameter):
Θ
Slope < 0 Slope > 0
theta = rand(vector)
while true:
theta_grad = evaluate_gradient(J, data, theta)
theta = theta - alpha * theta_grad
if (norm(theta_grad) <= beta) break

Gradient Descent: Potential Problems
• The surface may not be convex.
• In other words, there may be many local optima or
saddle points.
• There, the gradient will be 0 even if it is not the global
minimum.
• Works only if the cost function is differentiable.

• Some loss functions (e.g., 0/1 loss) are not always
differentiable.
• Another smooth surrogate function is adopted.
• Converging to a local minimum can be quite slow.

• The gradient is computed for all data samples, then its
average is used to update the parameters.
• With large dataset, computing gradients for all data
points takes long time.

Stochastic Gradient Descent
• Instead of computing gradients for all training examples, do it only on a
randomly sampled subset.
• The most extreme case is updating parameters for every single sample.
• Most common way is in the middle: do it with a minibatch of size 32,
64, 128, 256, …, 8192.
• The optimal size depends on the problem, data, and hardware (memory).
• When the minibatch is small, the gradient estimation gets stable by a lot
when we double it. When it gets larger, we gain less but just the cost
doubles. (diminishing returns)

ML-based Approach
Now, we have all the components including how to set the values in W
(parameters)!
• Compare our estimation y^ to ground truth label y, From each iteration, the
loss is expected to
decrease. After enough
• Update the parameters (W) based on this loss. iterations, the finishing
• Repeat this until we get y^ ≈ y. criteria (y^ ≈ y) will be
satisfied.
An Example of Training Loss
Q: why do we see this noisy curve?
Q: when to stop training?

ML-based Approach More advanced ML courses will
mostly focus on modeling for
various problems.
We are done with
estimating how good/bad we are currently. this part now!
• Update the parameters (W) based on this loss.

Lecture 22 - Loss Functions and Optimization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 22 - Loss Functions and Optimization

Uploaded by

Copyright:

Available Formats

Principles and Application of Data Science

Loss Functions & Optimization

SNU Graduate School of Data Science

Image x Weights or parameters W

corresponding to each pixel

This is the score

• If the values in W change,

• With a change in b, the

Similar to kNN in that both compare distances, but

• It’d be nicer if we get a bounded

Generalizing this to n > 2 classes:

• From the estimated probability,

• Naturally extends to K > 2 classes:

→ This is the sum of -log(“our predicted y = log x

Foundation of Data Science SNU Graduate School of Data Science 30

Foundation of Data Science SNU Graduate School of Data Science 31

Foundation of Data Science SNU Graduate School of Data Science 32

• Idea: for the current value of

Foundation of Data Science SNU Graduate School of Data Science 33

• Update equation (for a single parameter):

Foundation of Data Science SNU Graduate School of Data Science 34

• Works only if the cost function is diﬀerentiable.

• Converging to a local minimum can be quite slow.

Foundation of Data Science SNU Graduate School of Data Science 35

Foundation of Data Science SNU Graduate School of Data Science 36

Foundation of Data Science SNU Graduate School of Data Science 38

You might also like