You are on page 1of 23

""

" "
en
ai
Oy
Katey

Machine Learning Course - CS-433

Optimization

Sep 19+24, 2019

minor changes by Martin Jaggi 2019,2018,2017; c Martin Jaggi and Mohammad Emtiyaz Khan
2016
Last updated on: September 19, 2019
Learning / Estimation / Fitting
Given a cost function L(w), we wish
LCH
fwcx.DZ
to find w? which minimizes the cost:

min L(w)
w
subject to w 2 RD
=
II. (Yn -

This means the learning problem is Examples


:

formulated as an optimization prob- • linear Model


lem. fwlx ) WWTX =

We will use an optimization algo-


rithm to solve the problem (to find • Neural Network
a good w).
tone
Grid Search .

Grid search is one of the simplest


i in
optimization algorithms. We com-
pute the cost over all values w in a l%HQ
Em
°o°

grid, and pick the best among those. ?


n .

This is brute-force, but extremely fwwcxnl


simple and works for any kind of
Lew,
'
cost function when we have very

I
few parameters and the cost is easy
to compute.

:c
.

:
D= 2 i

#.#¥T
"

w.
① For a large number of parame- W E IRD
ters D, however, grid search has
too many “for-loops”, resulting
in an exponential computational
complexity:
# tries
If we decide to use 10 possible
- = TOP
values for each dimension of w,
then we have to check 10D points.
This is clearly impossible for most
practical machine learning models,
which can often have D ⇡ millions
of parameters. Choosing a good
range of values for each dimension
is another problem.

② Other issues: No guarantee can be

Kir
-

given that we end up close to an op-

÷÷
timum.
Optimization Landscapes
Sent
SGD

a¥€⇐ .
.
.

w*o w IRD

The above figure is taken from Bertsekas, Nonlinear programming.

A vector w? is a local minimum of L


if it is no worse than its neighbors;
i.e. there exists an ✏ > 0 such that,
neighborhood
?
L(w )  L(w), 8w with kw w? k < ✏

A vector w? is a global minimum


of L if it is no worse than all oth-
ers,
L(w?)  L(w), 8w 2 RD

I
A local or global minimum is said
to be strict if the corresponding
-

inequality is strict for w 6= w?.


Smooth Optimization
Follow the Gradient
A gradient (at a point) is the slope
of the tangent to the function
(at that point). It points to the
direction of largest increase of the
function.

For a 2-parameter model, MSE(w)


and MAE(w) are shown below.
(We used yn ⇡ w0 + w1 xn1 with y> = [2, 1, 1.5] and x> = [ 1, 1, 1]) .
160

:p
140

in
120
.

*
100

80

60

40

20


:⇐E.
0
10
5 10
0 5
0
−5 −5
−10 −10

15
we
on see,
Lai
.

10

/
5

MAE

← →
0
10
5
0 10
5
−5 0
−5 w
−10 −10
Definition of the gradient:
 >
@L(w) @L(w)
rL(w) := ,...,
@w1 @wD
This is a vector, rL(w) 2 RD .

Gradient Descent
To minimize the function, we itera-
tively take a step in the (opposite)
direction of the gradient + =
1,43 ,
. . .
,T
w(t+1) := w(t) rL(w(t))

where > 0 is the step-size (or


learning rate). Then repeat with
popular steps 'ze . :

the next t. F- Me, =


Et
17=1
Example: Gradient descent for 1- few) fwwcxl
,
parameter model to minimize MSE: N
WH rDLcw"' II

lzafzfyn WH
= -

(t+1) (t)
w0:= (1 )w0 + ȳ -

h =p
P
where ȳ := n yn/N . When is this

eowxttkcwwi-fw.sc
sequence guaranteed to converge?
"
559'

: iiiieeiiie
=¥E -

ten -

,
no

I
Wf
=
Wo
I
-

'
water,
"
pg WH Wo
step)
-

(
one
Gradient Descent for Linear MSE W =
Wn ,
. . -
YD
For linear regression
2 3 2 3
y1 x11 x12 . . . x1D ← datapoint 1
6 y2 7 6 x21 x22 . . . x2D 7 Xnt
y=6 7 6
4 .. 5 , X = 4 .. .. . . . .. 5
7

yN xN 1 xN 2 . . . xN D
We define the error vector e:

e=y Xw

and MSE as follows:

L(w) :=
1 X
N

2N n=1
1 >
= 2N e e
yn ix>
n w
2
I
ix.

.
."
=±*×
then the gradient is given by few,L= -

IX : Dfe
rL(w) =
① 1
N X >
e
Honewor
Computational cost. What is
the complexity (# operations) of
computing the gradient?
-
① compute e

a) starting from w and cost :


OCN D) -

t 01N)
for y
-

for Xw
. .

b) given e and w? ② compute DL given e

cost : OCN D) t OCD)


-

IIe for In .
. .

total =
O(N 'D) )
(order notation
Variant with o↵set. Recall: Alter-
native trick when also incorporating an o↵-
set term for the regression:
← artificial feature for each
2 3 2 3 data point
y1 1 x11 x12 . . . x1D
6 7 6 7
6 y2 7 6
e =6 . .1 x 21 x 22 . . . x2D 7
y = 6 .. 7 X . . . . .. 7
.
4 . 5 4 .. .. .. 5
yN 1 xN 1 xN 2 . . . xN D Exercise :

compute Phew,
( Wo Wn . . .
Wp ) = w

Stochastic Gradient Descent


Sum Objectives. In machine
learning, most cost functions are
formulated as a sum over the
training examples, that is
N ←
cost of data
point n

1 X
L(w) = Lnn(w) ,
N n=1

where Ln is the cost contributed by


the n-th training example.
Q: What are the Ln for linear MSE? ← Lncwk
The SGD Algorithm. The # yn -

wTxn/2
stochastic gradient descent (SGD)
algorithm is given by the following
update rule, at step t:
Me point Eth it uniformly atrandom

t ra i n i n g
one n . -

w(t+1) := w(t) rLn(w(t)) .

,
Theoretical Motivation. Idea: Eltham )
Cheap but unbiased estimate of the
Futz
-

gradient! = f- not
In expectation over the random
choice of n, we have =D (E ) .

E [rLn(w)] = rL(w) ① =D Lew)

which is the true gradient direction.


(check!)
Mini-batch SGD. There is an in-
termediate version, using the update
"a⇐Ei¥:c
grind
direction being 'd
1 X " "
g := rLn(w(t)) . ..
. .

|B|
n2B Example
again with

B = {n)

(t+1) (t)
→ SAD
w := w g.
• B = 11 , . . .
N)
,

In the above gradient computation,


we have randomly chosen a subset → (full)
gradient
descent
B ✓ [N ] of the training exam-
ples. For each of these selected ex- • 1131=5
amples n, we compute the respective →
mini -
batch
gradient rLn, at the same current
point w(t).
The computation of g can be par-
allelized easily. This is how cur-
rent deep-learning applications uti-
lize GPUs (by running over |B|
-

threads in parallel).
Note that in the extreme case B :=
[N ], we obtain (batch) gradient de-
scent, i.e. g = rL.

SGD for Linear MSE


See Exercise Sheet 2.
-

Computational cost. For linear


MSE, what is the complexity (# op-
erations) of computing the stochas-
tic gradient?
(using only |B| = 1 data examples)
a) starting from w? #¥ .

'

④ :
hand -
-

II. Nyu xntwl


-

DLCWF -

fXt(y - Xu)
cost :
OCN -
D) Meek "

'
Llyn xntw)
S④ Lncw)
-
=
:

-xntG
Yi costs
Phase

OCD)
en EIR
No differable functions
,,n
-

Non-Smooth Optimization
An alternative characterization of
convexity, for di↵erentiable func-


tions is given by

L(u) ←
L(w)+rL(w)>(u w) 8u, w

meaning that the function must al-


ways lie above its linearization.

Subgradients
A vector g 2 RD such that

¥¥÷
L(u) + >
L(w) + g (u w) 8u

is called a subgradient to the func-


tion L at w.
:

This definition makes sense for


objectives L which are not neces-
sarily di↵erentiable (and not even
necessarily convex).

$
If L is convex and di↵erentiable
at w, then the only subgradient at w s
is g = rL(w).
Subgradient Descent .
Identical to the gradient descent al-
gorithm, but using a subgradient in- hinge -
Loss
stead of gradient. Update rule

w(t+1) := w(t) g

for g being a subgradient to L at the


current iterate w(t).

Example: Optimizing Linear MAE


-

1. Compute a subgradient of the


absolute value function

h : R ! R , h(e) := Pd
|e|.
j
'
toy example

set
-
y if eco

.in?jti isubgaradiEt--iz
2. Recall the definition of the
mean absolute error:
N
1 X
O
L(w) = MAE(w) :=
N n=1 l
yn f (xn)d
w

For linear regression, its


(sub)gradient is easy to com-
Las --
hcgcwl)
".
pute using the chain rule.
Compute it!
-
ii. iiiiieiimiii
Sub gradient of L at w

See Exercise Sheet 2.


E 2h ( gu) Dgcw)
- -

g
setohfasbgradints at
ga
Stochastic Subgradient Descent
Stochastic SubGradient Descent
(still abbreviated SGD commonly).
Lafitte Smart
.

T
-

T
Same, g being a subgradient to the possibly non -

randomly selected Ln at the current differentiable


iterate w(t).

Exercise: Compute the SGD up-


date for linear MAE. for linear models

computational

.af:÷÷÷÷f÷÷÷.
cost

sub of
gradient DL, gradient
Ln
SGD OCD) OCD)
In
Constrained Optimization
Sometimes, optimization problems
come posed with additional con-
straints:

Omin L(w),
w
subject to w 2 C.
-

The set C ⇢ RD is called the


constraint set.

L(w) C ⇢ RD
w

Hq
w
C ⇢ RD

Solving Constrained Optimiza-


tion Problems

8
A) Projected Gradient Descent
B) Transform it into an uncon-
strained problem
Convex Sets
A set C is convex i↵
the line segment between any two
points of C lies in C, i.e., if for any
u, v 2 C and any ✓ with 0  ✓  1,
we have
✓u + (1 ✓)v 2 C. 2 Convex sets

*Figure 2.2 from S. Boyd, L. Vandenberghe

Figure 2.2 Some simple convex and nonconvex sets. Left. The hexagon,
which includes its boundary (shown darker), is convex. Middle. The kidney
shaped set is not convex, since the line segment between the two points in
the set shown as dots is not contained in the set. Right. The square contains
Properties
some of but
boundary points Convex Sets
not others, and is not convex.

• Intersections of convex sets are convex

• Projections onto convex sets are unique.


(and often efficient to compute)
Formal definition:
PC (w0) := arg minv2C kv w0k.
Figure 2.3 The convex hulls of two sets in R2 . Left. The convex hull of a
set of fifteen points (shown as dots) is the pentagon (shown shaded). Right.
The convex hull of the kidney shaped set in figure 2.2 is the shaded set.

Roughly speaking, a set is convex if every point in the set can be seen by every other
point, along an unobstructed straight path between them, where unobstructed
means lying in the set. Every affine set is also convex, since it contains the entire
line between any two distinct points in it, and therefore also the line segment

Projected Gradient Descent
Idea: add a projection onto C after
every step:
°0) := arg min kv
PC (w w0 k .
v2C

Update rule:
(t+1)
⇥ (t) (t)

w := PC w rL(w ) .

*
9
w•
• l C ⇢ RD
.
til
PC (w )

C rL(w)
w

Projected SGD. Same SGD


step, followed by the projection
step, as above. Same convergence
properties.

• Computational cost of projection?


Crucial!
②Turning Constrained into Unconstrained
Problems
(Alternatives to projected gradient
methods)
Use penalty functions instead of di-
rectly solving minw2C L(w) .
-

• “brick wall ” (indicator func-


tion) (
0 w2C
IC (w) :=
1 w2 /C
) min L(w) + IC (w)
w2RD
X
← additional
(disadvantage: non-continuous objective) penalty
;
original cost
• Penalize error. Example:
C = {w 2 RD | Aw = b}

11
) min L(w)+ kAw
tradeoff
w2RD
X
bk1122
2

two
t Dw=Hb
• Linearized Penalty Functions
(see Lagrange Multipliers)
-
Implementation Issues
For gradient methods: FED Ln

%÷÷:c
"
Stopping criteria: When"rL(w) Il .

is (close to) zero, we are (often) close


to the optimum.
④ orit

¥#
order
Optimality: If the second-order

④←
derivative is positive (positive semi-
definite to be precise), then it is a . .

÷÷÷÷÷÷
(possibly local) minimum. If the
function is also convex, then this convex
condition implies that we are at a and
global optimum. See the supplemen-
tary section on Optimality Condi-
tions.

Step-size selection: If is too


big, the method might diverge. If optimality
it is too small, convergence is slow. condition
Convergence to a local minimum
is guaranteed only when < min
where min is a fixed constant that


depends on the problem.

1117111 to
Line-search methods: For some
objectives L, we can set step-size
automatically using a line-search
method. More details on “back-
tracking” methods can be found in
Chapter 1 of Bertsekas’ book on
“nonlinear programming”.

Feature normalization and pre-


conditioning: Gradient descent is
very sensitive to ill-conditioning.
-

Iii:
Therefore, it is typically advised
to normalize your input features.
In other words, we pre-condition
the optimization problem. With-
out this, step-size selection is more
difficult since di↵erent “directions”
might converge at di↵erent speed.

⇐@
Non-Convex Optimization

*image from mathworks.com

Real-world problems are not convex!

All we have learnt on algorithm de-


sign and performance of convex al-
gorithms still helps us in the non-
convex world.
Additional Notes
Grid Search and Hyper-Parameter Optimization
Read more about grid search and other methods for “hyperparameter”
setting:
en.wikipedia.org/wiki/Hyperparameter optimization#Grid search.

Computational Complexity
The computation cost is expressed using the big-O notation. Here is a
definition taken from Wikipedia. Let f and g be two functions defined on
some subset of the real numbers. We write f (x) = O(g(x)) as x ! 1,
if and only if there exists a positive real number c and a real number x0
such that |f (x)|  c|g(x)|, 8x > x0.

Please read and learn more from this page in Wikipedia:


en.wikipedia.org/wiki/Computational complexity of
mathematical operations#Matrix algebra .

• What is the computational complexity of matrix multiplication?


• What is the computational complexity of matrix-vector multipli-
cation?

Optimality Conditions
For a convex optimization problem, the first-order necessary condition
says that at an optimum the gradient is equal to zero.

rL(w?) = 0 (1)

The second-order sufficient condition ensures that the optimum is a


minimum (not a maximum or saddle-point) using the Hessian matrix,
which is the matrix of second derivatives:
? @ 2L(w?)
H(w ) := is positive semi-definite. (2)
@w@w>
The Hessian is also related to the convexity of a function: a twice-
di↵erentiable function is convex if and only if the Hessian is positive
semi-definite at all points.

SGD Theory
As we have seen above, when N is large, choosing a random training
example (xn, yn) and taking an SGD step is advantageous:

w(t+1) := w(t) + (t)


rLn(w(t))

For convergence, (t) ! 0 “appropriately”. One such condition called


the Robbins-Monroe condition suggests to take (t) such that:
1
X 1
X
(t) (t) 2
= 1, ( ) <1 (3)
t=1 t=1

(t)
One way to obtain such sequences is := 1/(t + 1)r where r 2 (0.5, 1).
More Optimization Theory
If you want, you can gain a deeper understanding of several optimization
methods relevant for machine learning from this survey:
Convex Optimization: Algorithms and Complexity
- by Sébastien Bubeck

And also from the book of Boyd & Vandenberghe


(both are free online PDFs) (> 35 000
citations)

Exercises

1. Chain-rule

If it has been a while, familiarize yourself with it again.


2. Revise computational complexity (also see the Wikipedia link in
Page 6 of lecture notes).
3. Derive the computational complexity of grid-search, gradient de-
scent and stochastic gradient descent for linear MSE (# steps and
cost per step).
4. Derive the gradients for the linear MSE and MAE cost functions.
5. Implement gradient descent and gain experience in setting the
step-size.
6. Implement SGD and gain experience in setting the step-size.

You might also like