for optimization of convex non-smooth functions

April 1, 2009

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Deﬁnition

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is deﬁned as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Deﬁnition

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is deﬁned as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Deﬁnition

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is deﬁned as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to ﬁnd minima

Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually ﬁxed ahead of time
(k ) (k −1)

Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually ﬁxed ahead of time
(k ) (k −1)

Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually ﬁxed ahead of time
(k ) (k −1)

Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually ﬁxed ahead of time
(k ) (k −1)

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and ﬁnd f (x) and g Update model and repeat till desired accuracy Numerically unstable

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and ﬁnd f (x) and g Update model and repeat till desired accuracy Numerically unstable

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and ﬁnd f (x) and g Update model and repeat till desired accuracy Numerically unstable

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and ﬁnd f (x) and g Update model and repeat till desired accuracy Numerically unstable

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and ﬁnd f (x) and g Update model and repeat till desired accuracy Numerically unstable

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous At a generic iteration we maintain a “bundle” < yi , f (yi ), si , αi >

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous At a generic iteration we maintain a “bundle” < yi , f (yi ), si , αi >

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Convergence

The Algorithm either makes a ﬁnite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an inﬁnite number of Serious steps ˆ f (x0 ) − f ∗ so δk → 0 Then, k ∈Ks δk ≤ m

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modiﬁcation of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modiﬁcation of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modiﬁcation of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modiﬁcation of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modiﬁcation of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Naum Z. Shor Minimization Methods for non-differentiable functions. Springer-Verlag, 1985. Boyd and Vanderberge Convex Optimization. Cambridge University Press A. Ruszczyinski Nonlinear Optimization Princeton University Press Wikipedia en.wikipedia.org/wiki/Subgradient_method

Marko Makela Survey of Bundle Methods, 2009 http://www.informaworld.com/smpp/ content~db=all~content=a713741700 Alexandre Belloni An Introduction to Bundle Methods http://web.mit.edu/belloni/www/ LecturesIntroBundle.pdf John E. Mitchell Cutting Plane and Subgradient Methods, 2005 http://www.optimization-online.org/DB_HTML/ 2009/05/2298.html