You are on page 1of 12

My first Solver

Optimization for ML
Extra Class
February 07, 2023 (Wednesday), 6PM, L20
In lieu of missed class on January 26, 2024
The “best” Linear Classifier 3
Final Form of C-SVM 4
Recall that the C-SVM optimization finds a model by solving

s.t. for all


as well as for all
Using the previous discussion, we can rewrite the above very simply
Use Calculus for Optimization 5
Method 1: First order optimality Condition
Exploits the fact that gradient must vanish at a local optimum
Also exploits the fact that for convex functions, local minima are global
Warning: works only for simple convex functions when there are no
constraints
To Do: given a convex function that we wish to minimize, try finding
all the stationary points of the function (set gradient to zero)
If you find only one, that has to be the global minimum 
Example:
only at
i.e. is cvx i.e. is global min
Use Calculus for Optimization 6
Method 2: Perform (sub)gradient descent
Recall that direction opposite to gradient offers steepest descent
(SUB)GRADIENT DESCENT How to initialize ?
1. Given: obj. func. to minimize How to choose
2. Initialize Often called “step
3. For length” or “learning
rate”
1. Obtain a (sub)gradient
2. Choose a step length What is convergence?
3. Update How to decide if we
4. Repeat until convergence have converged?
Gradient Descent (GD)
Choose step length carefully
else may overshoot the
7
Move opposite global minimum even with
to the gradients great initialization
Also,
initialization may
affect result
Our initialization was suchThis
that time
we converged to a local initialization
minimum was
really nice!

Global With convex fns, all local


minima are global Still need to be careful with
minimum step lengths otherwise may
minima and can afford to
be less carefull with overshoot global minima
initialization
Behind the scenes in GD for SVM
So gradient descent, although a mathematical tool
from calculus, actually tries very actively to make
the model perform better on all data points
8
(ignore bias for now)
, where

Assume for a moment for sake of understanding

No change to due to the


Small : is large do not change too much!
data point
Large : Feel free to change as much as the gradient dictates
If does well on , say , then
If does badly on , say , then may get much better
margin on than
Stochastic Gradient Method 9
, where
Calculating each takes time since - total
At each time, choose a random data point
- only time!!
Warning: may have to perform several SGD steps than we had to do
with GD but each SGD step is much cheaper than a GD step
We take a random data point to avoid being unlucky (also it is
cheap)
Do we really need to spend so Initially, all we need is a general
much time on just one update? direction in which to move
No, SGD gives a cheaper
Especially in the beginning, when
way to perform gradient
we are far away from the optimum!
Mini-batch SGD 10
If data is very diverse, the “stochastic” gradient may vary quite a lot
depending on which random data point is chosen
This is called variance (more on this later) but this
can slow down the SGD process – make it jittery
One solution, choose more than one random point
At each step, choose random data points ( = mini batch size) without
replacement, say and use

Takes time to execute MBSGD – more expensive than SGD


Notice that if then MBSGD becomes plain GD
Coordinate Descent
Sometimes we are able to optimize completely along a given
variable (even if constraints are there) – called coordinate
minimization
11
(CM) is changed in a single step
Similar to GD except only one coordinate
E.g. s.t. with as th partial derivative
CCD: choose coordinate cyclically
i.e. COORDINATE DESCENT
SCD: choose randomly 1. For
1. Select a coordinate
Block CD: choose a small set of 2. Let
coordinates at each to update
3. Let for
Randperm: permute coordinates 4. Repeat until convergence
randomly and choose them in
that order. Once the list is over, choose a new random permutation
and start over (very effective)
Constrained Optimization 12
Method 3: Creating a Dual Problem
Suppose Let
we us
wish
see to
howsolve
to handle
multiple constraints and
s.t. equality constraints
Trick: sneak this constraint into the objective
Construct a barrier (indicator) fn so that
Hmm … we still have a
if and otherwise, and simply solve
constraint here, but a very
Easy to see that both problems have the same simple
solutionone i.e.
One very elegant way to construct such a barrier is the following

Thus, we want to solve


Same as

You might also like