Professional Documents
Culture Documents
Introduction To Non-Linear Optimization: Ross A. Lippert
Introduction To Non-Linear Optimization: Ross A. Lippert
Introduction To Non-Linear Optimization: Ross A. Lippert
Ross A. Lippert
D. E. Shaw Research
R. A. Lippert
Non-linear optimization
Optimization problems
problem: Let f : Rn (, ],
find
find
min {f (x)}
xRn
Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : Rn R, smooth?
find
x s.t. f (x ) = 0
R. A. Lippert
Non-linear optimization
nD optimization
R. A. Lippert
1D optimization
Non-linear optimization
Finding f (x) = 0
f (x) = b Ax
= 0
x = A1 b
R. A. Lippert
Non-linear optimization
2.5
0.5
1.5
1.5
0.5
2.5
0
1
3
1
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.8
0.6
0.5
0.4
1
0.2
1.5
0
1
2
1
1
0.5
0
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
1
R. A. Lippert
Non-linear optimization
1
f (x) = c x t b + x t Ax
2
Linear solve: x = A1 b.
Even for non-linear problems: if optimal x near our x
1
(x x)t f (x) (x x) +
2
= x x (f (x))1 f (x)
R. A. Lippert
Non-linear optimization
Linear solve
x = A1 b
But really we just want to solve
Ax = b
Dont form A1 if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,
a1 0 0
diagonal: A = 0 a2 0 thus xi = a1i bi
0 0 a3
a11 0
0
P
triangular: A = a21 a22 0 , xi = a1ii bi j<i aij xj
a31 a32 a33
R. A. Lippert
Non-linear optimization
Direct methods
A is symmetric positive definite.
Cholesky factorization:
A = LLt ,
where L lower triangular. So x = Lt L1 b by
X
1
Lz = b, zi =
bi
Lij zj
Lii
j<i
X
1
zi
Lt x = z, xi =
Lij xj
Lii
j>i
R. A. Lippert
Non-linear optimization
Direct methods
R. A. Lippert
Non-linear optimization
R. A. Lippert
Non-linear optimization
2
1
1.2
0.9
0.8
0.6
magnitude
magnitude
0.7
0.5
0.4
0.3
0.2
0.6
0.4
0.2
0.1
0
0.8
iteration
10
R. A. Lippert
iteration
Non-linear optimization
10
Convergence
100000
1
1e05
magnitude
magnitude
0.1
0.01
1e10
1e15
1e20
0.001
1e25
1e04
iteration
1e30
10
iteration
10
100000
1
magnitude
1e05
1e10
1e15
1e20
1e25
1e30
iteration
10
R. A. Lippert
Non-linear optimization
Iterative methods
Motivation: directly optimize f (x) = c x t b + 12 x t Ax.
gradient descent:
1
2
3
rit ri
rit Ari
minimizes f (x + ri )
1
1
f (xi + ri ) = c xit b + xit Axi + rit (Axi b) + 2 rit Ari
2
2
1
= f (xi ) rit ri + 2 rit Ari
2
(Cost of a step = 1 A-multiply.)
R. A. Lippert
Non-linear optimization
Iterative methods
Optimize f (x) = c x t b + 12 x t Ax.
conjugate gradient descent:
1
2
Ari
Pick i = d t i1Ad
i1
i1
t Ad = 0.
, ensures di1
i
Pick i =
dit ri
:
dit Adi
minimizes f (xi + di )
f (xi + di ) = c xit b +
1 t
1
xi Axi dit ri + 2 dit Adi
2
2
t d = 0)
(also means that ri+1
i
i1
R. A. Lippert
i1 i1
Non-linear optimization
i1 i1
A cute result
conjugate gradient descent:
1
2
3
ri = b Axi
1 t
x Ax.
2
Non-linear optimization
0.8
0.6
0.4
0.2
R. A. Lippert
0.2
0.4
0.6
Non-linear optimization
0.8
max eig(A)
min eig(A)
For CG,
1 i
||ri ||
+ 1
1 i
||ri ||
+ 1
Non-linear optimization
x = Pt y
Non-linear optimization
Class project?
One idea for a preconditioner is by a block diagonal matrix
L11 0
0
P 1 = 0 L22 0
0
0 L33
where Ltii Lii = Aii a diagonal block of A.
In what sense does good clustering give good preconditioners?
End of solvers: there are a few other iterative solvers out there
I havent discussed.
R. A. Lippert
Non-linear optimization
f continuous.
A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b
Non-linear optimization
1D optimization
Fundamentally limited accuracy of derivative-free argmin:
unbracketed:
1
2
R. A. Lippert
Non-linear optimization
R. A. Lippert
Non-linear optimization
xi+1 = xi + xi
Asymptotic convergence, ei = xi x
R. A. Lippert
Non-linear optimization
Sources of trouble
1
R. A. Lippert
Non-linear optimization
1.5
0.5
0.5
Non-linear optimization
1.5
1.5
0.5
0.5
1.5
Non-linear optimization
Non-linear Newton
Try to enforce f (xi+1 ) f (xi )
xi
xi+1 = xi + i xi
i (0,]
Non-linear optimization
Line searching
R. A. Lippert
Non-linear optimization
Practicality
R. A. Lippert
Non-linear optimization
Iterative methods
gradient descent:
1
2
3
linearized i =
rit (f )ri
rit ri
R. A. Lippert
Non-linear optimization
Iterative methods
(Polak-Ribiere)
i =
rit ri
t
ri1
ri1
(Fletcher-Reeves)
linearized i =
dit (f )di
rit di
R. A. Lippert
Non-linear optimization
(Polak-Ribiere)
Non-linear optimization
What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x0 = 0) at the end of step k, the solution x k is the
optimal linear combination of b, Ab, A 2 b, . . . Ak b for minimizing
c bt x +
1 t
x Ax.
2
R. A. Lippert
Non-linear optimization
Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x ))1 from the data we have
(f (x ))(xi xk 1 ) f (xi ) f (xk 1 )
R. A. Lippert
Non-linear optimization
BFGS update
sk ykt
sk s t
yk skt
Hk = I t
Hk 1 I t
+ t k
yk sk
yk sk
yk sk
Lemma
2
The BFGS update minimizes minH ||H 1 Hk1
1 ||F such that
Hyk = sk .
R. A. Lippert
Non-linear optimization
Quasi-Newton
Compute rk = f (xk ), yk = rk 1 rk
Compute dk = Hk rk
dit (f )di
)
rit di
R. A. Lippert
Non-linear optimization
Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f
f
concerns method
fast
fast
2,5
quasi-N (zero-search)
fast
fast
5
CG (zero-search)
fast
slow
1,2,3
derivative-free methods
fast
slow
2,5
quasi-N (min-search)
fast
slow
3,5
CG (min-search)
fast/slow slow
2,4,5
quasi-N with Armijo
fast/slow slow
4,5
CG (linearized )
1=time
2=space
3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert
Non-linear optimization