Professional Documents
Culture Documents
Lecture 22 - Loss Functions and Optimization
Lecture 22 - Loss Functions and Optimization
Principles and Application of Data Science SNU Graduate School of Data Science 2
Linear Classifier
Principles and Application of Data Science SNU Graduate School of Data Science 3
Parametric Approach
Instead of memorizing all training examples,
think of a function f that maps the input (image x) to the label scores (class y).
Label y
Image x
f(x)
Principles and Application of Data Science SNU Graduate School of Data Science 4
Parametric Approach: Linear Classifier
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels
Principles and Application of Data Science SNU Graduate School of Data Science 5
Parametric Approach: Linear Classifier
Q: What should be the form of this function f?
A: There can be many ways, but let’s start a simple one, a linear function!
→ weighted sum of input pixels: W1,1x1,1 + W1,2x1,2 + … + WM,NxM,N
Principles and Application of Data Science SNU Graduate School of Data Science 6
Parametric Approach: Linear Classifier
We need to determine the values in W, such that each class gets the highest
score for correct images.
For this example, class ‘cat’ should have highest score. Label y
Image x
Wx||
f(x,W)
Weights or
parameters W
Principles and Application of Data Science SNU Graduate School of Data Science 7
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x
f(x,W)
10
? × ?1
32 × 32 × 3 10 independent
classifiers f
(=3072 numbers) (for each class)
Principles and Application of Data Science SNU Graduate School of Data Science 8
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x
3072
? × ?1
f(x,W) =10?Wx
× 3072
?
10
? × ?1
32 × 32 × 3 10 independent Weights or
classifiers f parameters W
(=3072 numbers) (for each class)
Principles and Application of Data Science SNU Graduate School of Data Science 9
Parametric Approach: Linear Classifier
Let’s check dimensionality of each variable:
Label y
Image x
3072
? × ?1
f(x,W) =10?Wx
× 3072
?
+ b
?10× ×? 1
10
? × ?1
Bias b: affecting
the output
32 × 32 × 3 10 independent Weights or without
classifiers f parameters W interacting with
(=3072 numbers) (for each class) the data x
Principles and Application of Data Science SNU Graduate School of Data Science 10
Parametric Approach: Linear Classifier
A separate variable b makes things complicated...
3072 × 1
x
f(x,W) =10 Wx
10 × 1
× 3072
+ b
10 × 1
= [W b]
10 × 3073
[1 ]
3073 × 1
Let’s call
W’ x’
Thus, , and we may omit
f(x,W) = W’x’ ’’
Principles and Application of Data Science SNU Graduate School of Data Science 11
Parametric Approach: Linear Classifier
So, this is out final linear classifier model:
f(x,W) = Wx
Advantages of parametric models:
• Once trained, what we need is the weights W only. We do not need to store
the huge training dataset. → space efficient
• At test time, an example can be evaluated by a single matrix-vector
multiplication (Wx). → much faster than comparing against all training data.
Principles and Application of Data Science SNU Graduate School of Data Science 12
Illustration
Example with an image with 4 pixels and 3 classes (cat/dog/ship)
Principles and Application of Data Science SNU Graduate School of Data Science 13
Interpreting a Linear Classifier: Geometric Viewpoint
• Each linear boundary
comes from the
corresponding row of W.
Principles and Application of Data Science SNU Graduate School of Data Science 14
Interpreting a Linear Classifier: Visual Viewpoint
Principles and Application of Data Science SNU Graduate School of Data Science 15
Interpreting a Linear Classifier: Visual Viewpoint
What the linear classifier does is
● At training: learning the template from
training data
● At testing: performing the template
matching with a new example
Principles and Application of Data Science SNU Graduate School of Data Science 16
Limitations of Linear Classifier
What is the meaning of the score?
• It is unbounded: can be
arbitrarily large.
• Hard to interpret: what is the
meaning of 437.9? How good is
it?
Principles and Application of Data Science SNU Graduate School of Data Science 17
Limitations of Linear Classifier
What is the meaning of the score?
• Let’s define how much we are
confident that this image belongs
to each class.
• For 2 classes:
if s1 > s2: we believe this image is
in class 1 more than in class 2.
• The larger the gap (s1 - s2) is, the
more confident we are that x ∈ c1.
• Similarly, the other way also holds:
the larger s2 - s1 is, more likely x ∈
c2.
Principles and Application of Data Science SNU Graduate School of Data Science 18
Sigmoid Function
Given the score difference s (= s1 - s2), we
want a function that maps
• larger s closer to 1
• smaller (large negative) s closer to 0
• 0 to 0.5.
We may use sigmoid function σ(s) as a
bounded score for a class!
Principles and Application of Data Science SNU Graduate School of Data Science 19
Softmax Classifier
Let’s check if this is really a probability:
1) 0 ≤ p(y=cn|x) ≤ 1 (for n = 1, 2)
2) p(y=c1|x) + p(y=c2|x) = 1
Principles and Application of Data Science SNU Graduate School of Data Science 20
Softmax Classifier
Coming back to this example,
7e-260 ≈1 5e-122
Principles and Application of Data Science SNU Graduate School of Data Science 21
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
Loss Functions
estimating how good/bad we are currently.
• Update the parameters (W) based on this loss. Optimization
• Repeat this until we get y^ ≈ y.
Principles and Application of Data Science SNU Graduate School of Data Science 22
Loss Functions
Principles and Application of Data Science SNU Graduate School of Data Science 23
Loss Function
A loss function quantifies how good (or bad) our current machine learning
model is.
• A function of our prediction y^ and ground truth label y:
• Depending on how y^ and y are different, it outputs some positive
number to penalize the model.
• If y^ = y, we do not want to penalize the model, so the loss should be (around) 0.
• If y^ ≈ y, we might want to penalize the model to fine-tune.
• If the gap between y^ and y is huge, we should heavily penalize.
• So, how much y^ and y are different?
Principles and Application of Data Science SNU Graduate School of Data Science 24
Loss Function: Probabilistic Setting
• For a binary classification problem, the ground truth y = {0, 1}.
• Our model predicts one score y^, probability of one class.
• The other class probability is 1 - y^.
• Taking sigmoid function on the score difference is an example.
• In this setting, the loss function basically compares two (GT and
predicted) probability distributions.
Principles and Application of Data Science SNU Graduate School of Data Science 25
Cross Entropy
● From information theory: Measures the average number of bits needed
to identify an event drawn from the set.
○ General definition:
○ Binary classification:
Principles and Application of Data Science SNU Graduate School of Data Science 26
Cross Entropy
● Observation: Even if it sums over all classes (K), only one term survives,
as we have yik = 1 for only one k.
Ti: ground truth
index for example i
Principles and Application of Data Science SNU Graduate School of Data Science 27
Optimization
Principles and Application of Data Science SNU Graduate School of Data Science 28
How to find the Weights?
We have not found an answer for the biggest question yet:
How to set the values in W (parameters)?
• Machine learning is data-driven approach.
• We (human) just design the form of our model (e.g., f(x,W) = Wx), and
• Initialize the parameter values (W) randomly.
• Then, feed a training data x to estimate the label y^.
• Compare our estimation y^ to ground truth label y,
Loss Functions
estimating how good/bad we are currently.
• Update the parameters (W) based on this loss. Optimization
• Repeat this until we get y^ ≈ y.
Principles and Application of Data Science SNU Graduate School of Data Science 29
Optimization?
[Wikipedia] Mathematical optimization or mathematical programming is
the selection of a best element (with regard to some criterion) from some set
of available alternatives.
Θ
Slope < 0 Slope > 0
theta = rand(vector)
while true:
theta_grad = evaluate_gradient(J, data, theta)
theta = theta - alpha * theta_grad
if (norm(theta_grad) <= beta) break
Principles and Application of Data Science SNU Graduate School of Data Science 37
An Example of Training Loss
Q: why do we see this noisy curve?
Q: when to stop training?
Principles and Application of Data Science SNU Graduate School of Data Science 39