Professional Documents
Culture Documents
DD2424
Decision Boundary
g: Rd × Rp → {−1, 1}
↑ ↑
input space parameter space
Usually
• Have to decide on
1. Form of f (a hyperplane?) and
2. How to estimate f ’s parameters, θ̂, from D.
Learn decision boundary discriminatively
training error
zX }| {
arg min l(y, f (x; θ)) + λ R(θ)
θ | {z }
(x,y)∈D ↑
regularization term
where
- l(y, f (x | θ)) is the loss function and measures how well (and
robustly) f (x; θ) predicts the label y.
- The training error term measures how well and robustly the
function f (· ; θ) predicts the labels over all the training data.
- The regularization term measures the complexity of the
function f (· ; θ).
Usually want to learn simpler functions =⇒ less risk of over-fitting.
Comment on Over- and Under-fitting
Example of Over and Under fitting
Prediction Error
Test Sample
Training Sample
Low High
Model Complexity
Prediction Error
Test Sample
Training Sample
Low High
Model Complexity
wT x + b > 0
−w d
b
kdk = kwk
wT x + b < 0 wT x + b = 0
x1
(
1 if f (x; w, b) > 0
g(x; w, b) =
−1 if f (x; w, b) < 0
Pros & Cons of Linear classifiers
Pros
• Low variance classifier
• Easy to estimate.
Frequently can set up training so that have an easy optimization
problem.
Cons
• High bias classifier
Often the decision boundary is not well-described by a linear
classifier.
training examples will have better generalization capabil
HowTherefore,
" the optimal
do we choose separating
& learn hyperplane
the linear will be th
classifier?
margin, which is defined as the minimum distance of a
surface
x2 x2
O
pt
im
al
h
x1
0, 1 Loss function
For a single example (x, y) the 0-1
loss is defined as
3.5
(
3 0 if y = sgn(f (x; θ))
l(y, f (x; θ)) =
2.5 1 if y 6= sgn(f (x; θ))
0, 1 Loss
1.5
(
0 if y f (x; θ) > 0
1 =
1 if y f (x; θ) < 0
0.5
0
−2 −1 0 1 2 3 4
(assuming y ∈ {−1, 1})
y * f(x)
L2 regularization
d
X
2
R(w) = kwk = wi2
i=1
1 X
L(D, w, b) = lsq (y, f (x; w, b))
2
(x,y)∈D
1 X 2
= (wT x + b) − y
2 | {z }
(x,y)∈D
Squared error loss
∂xT a
=a (1)
∂x
∂aT x
=a (2)
∂x
∂aT Xb
= abT (3)
∂X
∂aT X T b
= baT (4)
∂X
Matrix Calculus
∂xT Bx
= (B + B T )x (5)
∂x
∂bT X T Xc
= X(bcT + cbT ) (6)
∂X
∂bT X T DXc
= DT XbcT + DXcbT (8)
∂X
End of Technical interlude
Pseudo-Inverse solution
w1 = X † y
where
−1
X† ≡ XT X XT
w1 = X † y
where
−1
X† ≡ XT X XT
Line Search
Perform an intensive search to find the minimum along the
chosen direction.
Choosing a search direction: The gradient
functionmoving in the
has a closed formdirection
solution of steepest descent.
may exist, but is numerically ill-
ments)
Gradient Descent Minimization
n iterative fashion by moving
2 1. Start with an arbitrary solution
Local x(0) .
minimum
2. Compute the gradient
∇x f (x(k) ).
x2
0
3. Move in the direction of
Initial
Global
steepest descent:
guess
minimum
x(k+1) = x(k) − η (k) ∇x f (x(k) ).
-2
-2 0 2
x1 where η (k) is the step size.
4
4. Go to 2 (until convergence).
unction minimization
Gradient descent: Method for function minimization
defined by the zeros of the gradient
J(x) $ 0
Gradient descent finds the minimum in an iterative fashion by
function has a closed form solution
moving in the direction of steepest descent.
may exist, but is numerically ill-
ments)
n iterative fashion by moving Gradient Descent Minimization
2 Properties
Local 1. Will converge to a local minimum.
minimum
2. The local minimum found
depends on the initialization x(0) .
x2
0
But this is okay
Initial 1. For convex optimization
guess Global
minimum problems: local minimum ≡
global minimum
-2
-2 0 2 2. For deep networks most
x1
parameter setting corresponding
4
to a local minimum are fine.
unction minimization
Gradient descent: Method for function minimization
defined by the zeros of the gradient
J(x) $ 0
Gradient descent finds the minimum in an iterative fashion by
function has a closed form solution
moving in the direction of steepest descent.
may exist, but is numerically ill-
ments)
n iterative fashion by moving Gradient Descent Minimization
2 Properties
Local 1. Will converge to a local minimum.
minimum
2. The local minimum found
depends on the initialization x(0) .
x2
0
But this is okay
Initial 1. For convex optimization
guess Global
minimum problems: local minimum ≡
global minimum
-2
-2 0 2 2. For deep networks most
x1
parameter setting corresponding
4
to a local minimum are fine.
End of Technical interlude
Gradient descent solution
Why?
• This avoids the numerical problems that arise when X T X is
(nearly) singular.
• It also avoids the need for working with large matrices.
How
(0)
1. Begin with an initial guess w1 for w1 .
2. Update the weight vector by moving a small distance in the
direction −∇w1 L.
Gradient descent solution
Solution
X T (Xw1 − y) = 0
1 X
J(D, θ) = l(y, f (x; θ)) + λR(θ)
|D|
(x,y)∈D
If |D| is large
• =⇒ computing ∇θ J(θ, D)|θ(t) is time consuming
• When |D (t) | = 1:
∇θ J(D(t) , θ)θ(t) a noisy estimate of ∇θ J(D, θ)|θ(t)
• Therefore
|D| noisy update steps in SGD ≈ 1 correct update step in GD.
• Still get lots of updates per epoch (one iteration through all the
training data).
What learning rate?
• Strategies
- Constant: η (t) = .01
√
- Decreasing: η (t) = 1/ t
1
= kXw + b1 − yk2 + λkwk2
2
where λ > 0 and small and X is the data matrix
← xT1 →
← xT2 →
X=
..
.
← xTn →
Solving Ridge Regression: Centre the data to simplify
← xTc,1 →
← xTc,2 →
Xc = where xc,i = xi − µx
..
.
← xTc,n →
=⇒ XcT 1 = 0.
• Optimal bias with centered input Xc (does not depend on w∗ ) is:
∂Jridge
= b1T 1 + wT XcT 1 − 1T y
∂b
= b1T 1 − 1T y
Pn
=⇒ b∗ = 1/n i=1 yi = ȳ.
Solving ridge regression: Optimal weight vector
∂Jridge
= XcT Xc + λId w − XcT y
∂w
λ=1 λ = 10
λ = 100 λ = 1000
Solving ridge regression: Optimal weight vector
∂Jridge
= XcT Xc + λId w − XcT y
∂w
l(x, y; w, b) = max 0, 1 − y(wT x + b)
max(0, 1 − yf (x; θ))
y f (x; θ)
-3 -2 -1 0 1 2 3
• 1D
(⇐⇒ (g,example:
−1) supports epi f at (x, f (x)))
PSfrag replacements f (x)
x1 x2
g2, g3 are
- g2subgradients at x2; g1atisxa ;subgradient at x1
, g3 are subgradients 2
- g1 is a subgradient at x1 .
Prof. S. Boyd, EE392o, Stanford University 2
Subgradient of a function
• 1D example:
f (x) = |x| ∂f (x)
Sfrag replacements
x
x
−1
λ X
Jsvm (D, w, b) = kwk2 + max 0, 1 − y(wT x + b)
2 | {z }
(x,y)∈D
Hinge Loss
xT β + β0 = 0 • xT β + β0 = 0 •
• • • •
• •• • ξ4∗ • ∗ ••
ξ5
• • • •
1
M= ∗
• ••
!β! •ξ ∗• ξ3 •• M = 1
1 !β!
• •• • •∗
• ξ2
• margin •
•• •• •• •• margin
1
M= !β!
• • 1
M= !β!
Alternative formulation of SVM optimization
SVM solves this constrained optimization problem:
n
!
1 T X
min w w+C ξi subject to
w,b 2 i=1
yi (wT xi + b) ≥ 1 − ξi for i = 1, . . . , n and
ξi ≥ 0 for i = 1, . . . , n.
yi (wT xi + b) ≥ 1 − ξi =⇒ ξi ≥ 1 − yi (wT xi + b)
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
Example dataset: CIFAR-10
airplane
automobile
CIFAR-10
bird
• 10 classes
cat
• 50,000 training
deer
images
dog
• 10,000 test
frog
images
horse
• Each image has
ship size 32 × 32 × 3
truck
Technical description of the multi-class problem
g: Rd × RP → {1, . . . , C}
↑ ↑
input space parameter space
Usually
where for j = 1, . . . , C:
fj : Rd × Rp → R
and Θ = (θ 1 , θ 2 , . . . , θ C ).
Multi-class linear classifier
flatten image
32 × 32 × 3 3072 × 1
W x b class scores
10 × 3072 3072 × 1 10 × 1 10 × 1
Interpreting a multi-class linear classifier
car classifier
0
airplane classifier
deer classifier
How do we learn W and b?
As before need to
• Specify a loss function (+ a regularization term).
As before need to
• Specify a loss function
- must quantify the quality of all the class scores across all the
training data.
sj = fj (x; wj , bj ) = wjT x + bj
C
X
l= max(0, sj − sy + 1)
j=1
j6=y
Multi-class SVM Loss
sj = fj (x; wj , bj ) = wjT x + bj
max(0, si − sy + 1)
si − sy
Scores
airplane -0.3166
car -0.6609
bird 0.7058
cat 0.8538
deer 0.6525
dog 0.1874
frog 0.6072
horse 0.5134
ship -1.3490
truck -1.2225
s = Wx + b
Calculate the multi-class SVM loss for a CIFAR image
input: x output label loss
P
10
s = Wx + b y=8 l= max(0, sj − sy + 1)
j=1
j6=y
f (x; W, b) = W x + b = s
C
X
l(x, y, W, b) = max(0, sj − sy + 1)
j=1
j6=y
1 X
L(D, W, b) = l(x, y, W, b)
|D|
(x,y)∈D
f (x; W1 , b1 ) = W1 x + b1 = s0 = α(W x + b)
C
X
l(x, y, W1 , b1 ) = max(0, s0j − s0y + 1)
j=1
j6=y
C
X X
1
L(D, W, b) = max(0, fj (x; W, b) − fy (x; W, b) + 1) + λR(W )
|D|
(x,y)∈D j=1
j6=y
P P 2
Elastic Net k l βWk,l + |Wk,l |
Cross-entropy loss
Probabilistic interpretation of scores
s = Wx + b
• Can interpret scores, s, as:
unnormalized log probability for each class.
=⇒
sj = log p0j
P 0
where αp0j = pj and α = pj .
=⇒
exp(sj )
PY |X (j | x) = pj = P
k exp(sk )
SoftMax operation
exp(sj )
SoftMax(s) = P
k exp(sk )
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
s SoftMax(s)
SoftMax operation
exp(sj )
SoftMax(s) = P
k exp(sk )
2
1
0.5
1
0
−0.5
−1 −1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
s SoftMax(s)
SoftMax classifier: Log likelihood of the training data
1 X
θ ∗ = arg max log PY |X (y | x; θ)
θ |D|
(x,y)∈D
1 X
= arg min − log PY |X (y | x; θ)
θ |D|
(x,y)∈D
where s = W x + b.
SoftMax classifier: Log likelihood of the training data
1 X
θ ∗ = arg max log PY |X (y | x; θ)
θ |D|
(x,y)∈D
1 X
= arg min − log PY |X (y | x; θ)
θ |D|
(x,y)∈D
where s = W x + b.
SoftMax classifier + cross-entropy loss
where s = W x + b.
p = SoftMax (W x + b)
− log(py )
py
• Cross-entropy loss for training example x with label y is
l = − log(py )
Calculate the cross-entropy loss for a CIFAR image
input: x output label loss
s = Wx + b y=8 l = − log Pexp(sy )
exp(sk )
Scores
airplane -0.3166
car -0.6609
bird 0.7058
cat 0.8538
deer 0.6525
dog 0.1874
frog 0.6072
horse 0.5134
ship -1.3490
truck -1.2225
s = Wx + b
Calculate the cross-entropy loss for a CIFAR image
input: x output label loss
s = Wx + b y=8 l = − log Pexp(sy )
exp(sk )
Scores exp(Scores)
s = Wx + b exp(s)
Calculate the cross-entropy loss for a CIFAR image
input: x output label loss
s = Wx + b y=8 l = − log Pexp(sy )
exp(sk )
s = Wx + b exp(s) Pexp(s)
exp(sk )
k
Cross-entropy loss
!
exp(sy )
l(x, y, W, b) = − log PC
k=1 exp(sk )
Questions
• What is the minimum possible value of l(x, y, W, b)?
• What is the max possible value of l(x, y, W, b)?
• At initialization all the entries of W are small =⇒ all sk 6= 0.
What is the loss?
• A training point’s input value is changed slightly. What happens to
the loss?
• The log of zero is not defined. Could this be a problem?
Learning the parameters: W, b
s = f (x; W, b) = W x + b
• We have a choice of loss functions
!
exp(sy )
lcross (x, y, W, b) = − log PC
k=1 exp(sk )
C
X
lsvm (x, y, W, b) = max(0, sj − sy + 1)
j=1
j6=y
where
1 X
J(D, W, b) = lcross(svm) (x, y, W, b) + λR(W )
|D|
(x,y)∈D
where
1 X
J(D, W, b) = lcross(svm) (x, y, W, b) + λR(W )
|D|
(x,y)∈D
where
1 X
J(D, W, b) = lcross(svm) (x, y, W, b) + λR(W )
|D|
(x,y)∈D