You are on page 1of 39

Principles and Application of Data Science

Loss Functions & Optimization


Lecture 22

SNU Graduate School of Data Science


Foundation of Data Science SNU Graduate School of Data Science 1
Review

Principles and Application of Data Science SNU Graduate School of Data Science 2
Linear Classifier

Principles and Application of Data Science SNU Graduate School of Data Science 3
Parametric Approach
Instead of memorizing all training examples,
think of a function f that maps the input (image x) to the label scores (class y).
Label y
Image x

f(x)

Principles and Application of Data Science SNU Graduate School of Data Science 4
Parametric Approach: Linear Classifier
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels

Image x Weights or parameters W

corresponding to each pixel

Principles and Application of Data Science SNU Graduate School of Data Science 5
Parametric Approach: Linear Classifier
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels: W1,1x1,1 + W1,2x1,2 + … + WM,NxM,N

This is the score


Image x Weights or parameters W for class airplane.
Label y
We make 10
different W for
each class,
similarly.

Principles and Application of Data Science SNU Graduate School of Data Science 6
Parametric Approach: Linear Classifier
We need to determine the values in W, such that each class gets the highest
score for correct images.
For this example, class ‘cat’ should have highest score. Label y
Image x
Wx||
f(x,W)
Weights or
parameters W

Principles and Application of Data Science SNU Graduate School of Data Science 7
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x

f(x,W)
10
? × ?1

32 × 32 × 3 10 independent
classifiers f
(=3072 numbers) (for each class)

Principles and Application of Data Science SNU Graduate School of Data Science 8
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x
3072
? × ?1

f(x,W) =10?Wx
× 3072
?
10
? × ?1

32 × 32 × 3 10 independent Weights or
classifiers f parameters W
(=3072 numbers) (for each class)

Principles and Application of Data Science SNU Graduate School of Data Science 9
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x
3072
? × ?1

f(x,W) =10?Wx
× 3072
?
+ b
?10× ×? 1
10
? × ?1
Bias b: affecting
the output
32 × 32 × 3 10 independent Weights or without
classifiers f parameters W interacting with
(=3072 numbers) (for each class) the data x

Principles and Application of Data Science SNU Graduate School of Data Science 10
Parametric Approach: Linear Classifier
A separate variable b makes things complicated...

3072 × 1
x
f(x,W) =10 Wx
10 × 1
× 3072
+ b
10 × 1
= [W b]
10 × 3073
[1 ]
3073 × 1

Let’s call
W’ x’
Thus, , and we may omit
f(x,W) = W’x’ ’’
Principles and Application of Data Science SNU Graduate School of Data Science 11
Parametric Approach: Linear Classifier
So, this is out final linear classifier model:

f(x,W) = Wx
Advantages of parametric models:
• Once trained, what we need is the weights W only. We do not need to store
the huge training dataset. → space efficient
• At test time, an example can be evaluated by a single matrix-vector
multiplication (Wx). → much faster than comparing against all training data.

Principles and Application of Data Science SNU Graduate School of Data Science 12
Illustration
Example with an image with 4 pixels and 3 classes (cat/dog/ship)

Principles and Application of Data Science SNU Graduate School of Data Science 13
Interpreting a Linear Classifier: Geometric Viewpoint
• Each linear boundary
comes from the
corresponding row of W.

• If the values in W change,


the corresponding decision
boundary will rotate.

• With a change in b, the


corresponding decision
boundary will go up/down.

Principles and Application of Data Science SNU Graduate School of Data Science 14
Interpreting a Linear Classifier: Visual Viewpoint

Principles and Application of Data Science SNU Graduate School of Data Science 15
Interpreting a Linear Classifier: Visual Viewpoint
What the linear classifier does is
● At training: learning the template from
training data
● At testing: performing the template
matching with a new example

Similar to kNN in that both compare distances, but


different from kNN in that linear classifier
compares only to K class templates, while kNN
does to N training examples. (usually, K << N)

Principles and Application of Data Science SNU Graduate School of Data Science 16
Limitations of Linear Classifier
What is the meaning of the score?

• It is unbounded: can be
arbitrarily large.
• Hard to interpret: what is the
meaning of 437.9? How good is
it?

• It’d be nicer if we get a bounded


score between 0 and 1, so that we
can interpret it as a probability.

Principles and Application of Data Science SNU Graduate School of Data Science 17
Limitations of Linear Classifier
What is the meaning of the score?
• Let’s define how much we are
confident that this image belongs
to each class.
• For 2 classes:
if s1 > s2: we believe this image is
in class 1 more than in class 2.
• The larger the gap (s1 - s2) is, the
more confident we are that x ∈ c1.
• Similarly, the other way also holds:
the larger s2 - s1 is, more likely x ∈
c2.

Principles and Application of Data Science SNU Graduate School of Data Science 18
Sigmoid Function
Given the score difference s (= s1 - s2), we
want a function that maps
• larger s closer to 1
• smaller (large negative) s closer to 0
• 0 to 0.5.
We may use sigmoid function σ(s) as a
bounded score for a class!

Principles and Application of Data Science SNU Graduate School of Data Science 19
Softmax Classifier
Let’s check if this is really a probability:
1) 0 ≤ p(y=cn|x) ≤ 1 (for n = 1, 2)
2) p(y=c1|x) + p(y=c2|x) = 1

Generalizing this to n > 2 classes:


Softmax
Function

Principles and Application of Data Science SNU Graduate School of Data Science 20
Softmax Classifier
Coming back to this example,

• From the estimated probability,


our softmax classifier chooses
class 2 (dog) with almost 100%
confidence.

7e-260 ≈1 5e-122

Principles and Application of Data Science SNU Graduate School of Data Science 21
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
Loss Functions
estimating how good/bad we are currently.
• Update the parameters (W) based on this loss. Optimization
• Repeat this until we get y^ ≈ y.

Principles and Application of Data Science SNU Graduate School of Data Science 22
Loss Functions

Principles and Application of Data Science SNU Graduate School of Data Science 23
Loss Function
A loss function quantifies how good (or bad) our current machine learning
model is.
• A function of our prediction y^ and ground truth label y:
• Depending on how y^ and y are different, it outputs some positive
number to penalize the model.
• If y^ = y, we do not want to penalize the model, so the loss should be (around) 0.
• If y^ ≈ y, we might want to penalize the model to fine-tune.
• If the gap between y^ and y is huge, we should heavily penalize.
• So, how much y^ and y are different?

Principles and Application of Data Science SNU Graduate School of Data Science 24
Loss Function: Probabilistic Setting
• For a binary classification problem, the ground truth y = {0, 1}.
• Our model predicts one score y^, probability of one class.
• The other class probability is 1 - y^.
• Taking sigmoid function on the score difference is an example.

• Naturally extends to K > 2 classes:


• Ground truth is represented as an one-hot vector: y = [0, 0, 0, 0, 1, 0, 0]
• The model predicts K - 1 scores, leaving the last one as 1 - sum.
• Softmax loss is an example: predictions are between 0 and 1, and sum to 1.

• In this setting, the loss function basically compares two (GT and
predicted) probability distributions.

Principles and Application of Data Science SNU Graduate School of Data Science 25
Cross Entropy
● From information theory: Measures the average number of bits needed
to identify an event drawn from the set.
○ General definition:

○ Binary classification:

Principles and Application of Data Science SNU Graduate School of Data Science 26
Cross Entropy
● Observation: Even if it sums over all classes (K), only one term survives,
as we have yik = 1 for only one k.
Ti: ground truth
index for example i

→ This is the sum of -log(“our predicted y = log x


probability for the true class”) over all samples.
● Since “our predicted probability” is between 0 and 1,
it corresponds to the orange range shown left. 0 1
→ If our estimate gets closer to 1, the loss converges
to 0. If our estimates gets closer to 0, the loss
increases. y = - log x

Principles and Application of Data Science SNU Graduate School of Data Science 27
Optimization

Principles and Application of Data Science SNU Graduate School of Data Science 28
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
Loss Functions
estimating how good/bad we are currently.
• Update the parameters (W) based on this loss. Optimization
• Repeat this until we get y^ ≈ y.

Principles and Application of Data Science SNU Graduate School of Data Science 29
Optimization?
[Wikipedia] Mathematical optimization or mathematical programming is
the selection of a best element (with regard to some criterion) from some set
of available alternatives.

Foundation of Data Science SNU Graduate School of Data Science 30


Optimization!
[Wikipedia] Mathematical optimization or mathematical programming is
the selection of a best element (with regard to some criterion) from some set
of available alternatives.

Foundation of Data Science SNU Graduate School of Data Science 31


Primitive Ideas
• Exhaustive search
• Random search
• Visualization
• Greedy search (solving each variable one by one)
• ....

Foundation of Data Science SNU Graduate School of Data Science 32


Following the slope
• We have a cost function
J(Θ) we want to minimize.

• Idea: for the current value of


Θ, calculate gradient of
J(Θ), then take small step in
direction of negative
gradient. Repeat.
• Recall that gradient (or
derivative) of a function at a
point is the best linear
approximation to the
function at that point.

Foundation of Data Science SNU Graduate School of Data Science 33


Gradient Descent
• Update equation (in matrix notation): J(Θ)

• Update equation (for a single parameter):

Θ
Slope < 0 Slope > 0

theta = rand(vector)
while true:
theta_grad = evaluate_gradient(J, data, theta)
theta = theta - alpha * theta_grad
if (norm(theta_grad) <= beta) break

Foundation of Data Science SNU Graduate School of Data Science 34


Gradient Descent: Potential Problems
• The surface may not be convex.
• In other words, there may be many local optima or
saddle points.
• There, the gradient will be 0 even if it is not the global
minimum.

• Works only if the cost function is differentiable.


• Some loss functions (e.g., 0/1 loss) are not always
differentiable.
• Another smooth surrogate function is adopted.

• Converging to a local minimum can be quite slow.


• The gradient is computed for all data samples, then its
average is used to update the parameters.
• With large dataset, computing gradients for all data
points takes long time.

Foundation of Data Science SNU Graduate School of Data Science 35


Stochastic Gradient Descent
• Instead of computing gradients for all training examples, do it only on a
randomly sampled subset.
• The most extreme case is updating parameters for every single sample.
• Most common way is in the middle: do it with a minibatch of size 32,
64, 128, 256, …, 8192.
• The optimal size depends on the problem, data, and hardware (memory).
• When the minibatch is small, the gradient estimation gets stable by a lot
when we double it. When it gets larger, we gain less but just the cost
doubles. (diminishing returns)

Foundation of Data Science SNU Graduate School of Data Science 36


ML-based Approach
Now, we have all the components including how to set the values in W
(parameters)!
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y, From each iteration, the
loss is expected to
estimating how good/bad we are currently.
decrease. After enough
• Update the parameters (W) based on this loss. iterations, the finishing
• Repeat this until we get y^ ≈ y. criteria (y^ ≈ y) will be
satisfied.

Principles and Application of Data Science SNU Graduate School of Data Science 37
An Example of Training Loss
Q: why do we see this noisy curve?
Q: when to stop training?

Foundation of Data Science SNU Graduate School of Data Science 38


ML-based Approach More advanced ML courses will
mostly focus on modeling for
various problems.
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
We are done with
estimating how good/bad we are currently. this part now!
• Update the parameters (W) based on this loss.
• Repeat this until we get y^ ≈ y.

Principles and Application of Data Science SNU Graduate School of Data Science 39

You might also like