Subgradient and Bundle Methods

for optimization of convex non-smooth functions

April 1, 2009

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Motivation

Many naturally occuring problems are nonsmooth
Hinge loss Feasible region of a convex minimization problem Piecewise Linear function

If a function is approximating a non-smooth function, then it may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Methods for nonsmooth optimizations

Approximate by a series of smooth functions Reformulate the problem adding more constraints such that the objective is smooth Subgradient Methods Cutting Plane Methods Moreau-Yosida Regularization Bundle Methods U V decomposition

Subgradient and Bundle Methods

Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Subgradient and Bundle Methods

Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Subgradient and Bundle Methods

Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y f (y ) ≥ f (x) + f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y f (y ) ≥ f (x) + g T (y − x) The set of all subgradients of f at x is denoted ∂f (x) (2)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the Subgradient of a convex function exists at every point. Directional derivatives also exist at every point. If a convex function f is differentiable at x, its subgradient is the gradient at that point. i.e. ∂f (x) = { f (x)} Subgradients are lower bounds for directional derivatives. f (x; d) = supg∈∂f (x) g, d Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x) ∂αf (x) = α∂f (x) g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b) Local minima ⇒ 0 ∈ ∂f (x) However, For f (x) = |x|, the oracle returns subgradient 0 only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods

Subgradient Method
Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually fixed ahead of time
(k ) (k −1)

Subgradient and Bundle Methods

Subgradient Method
Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually fixed ahead of time
(k ) (k −1)

Subgradient and Bundle Methods

Subgradient Method
Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually fixed ahead of time
(k ) (k −1)

Subgradient and Bundle Methods

Subgradient Method
Algorithm

Subgradient Method is NOT a descent method! x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x) fbest = min{fbest , f (x (k ) )} Line search is not performed. Step lengths αk usually fixed ahead of time
(k ) (k −1)

Subgradient and Bundle Methods

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods

Step Lengths

Commonly used Step lengths Constant step size: αk = α Constant step length: αk = αk = γ/ g (k )
2

Square summable but not summable step size: ∞ ∞ 2 αk ≥ 0, k =1 αk < ∞, k =1 αk = ∞. Nonsummable diminishing: αk ≥ 0, limk →∞ αk = 0,
∞ k =1 αk

= ∞.

Nonsummable diminishing step lengths: ∞ γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Subgradient and Bundle Methods

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Subgradient and Bundle Methods

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Subgradient and Bundle Methods

Convergence Result

Assume that ∃G such that the norm of the subgradients is bounded i.e. ||g (k ) ||2 ≤ G (For example, Suppose f is Lipshitz continuous) Result 2 k αi i=1 Proof is through proving ||x − x ∗ || decreases
k fbest

f∗

dist x (1) , X ∗

2

+ G2

k 2 i=1 αi

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Convergence for Commonly used Step lengths
G2 h of optimal 2 within Gh of optimal
(k ) (k )

Constant step size: fbest within Constant step length: fbest
(k )

(k )

Square summable but not summable step size: fbest → f ∗ Nonsummable diminishing: fbest → f ∗ Nonsummable diminishing step lengths: fbest → f ∗ fbest − f ∗ ≤
(k ) (k )

R 2 + G2 2

k 2 i=1 αi

k i=1 αi

R/G So, optimal αi are √ and converges in (RG/ )2 steps k

Subgradient and Bundle Methods

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods

Variations

If optimal value is known eg. if the optimal value is known to be 0, but the point is not known f (x (k ) ) − f ∗ αk = ||g (k ) ||2 Projected Subgradient: minimize f (x) s.t. x ∈ C x (k +1) = P(x (k ) + αk g (k ) ) Alternating projections: Find a point in the intesection of 2 convex sets Heavy Ball method: x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Pros
Can immediately be applied to a wide variety of problems, especially when accuracy required is not very high. Low memory usage Often possible to design distributed methods if objective is decomposible

Cons
Slower than second-order methods

Subgradient and Bundle Methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and find f (x) and g Update model and repeat till desired accuracy Numerically unstable

Subgradient and Bundle Methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and find f (x) and g Update model and repeat till desired accuracy Numerically unstable

Subgradient and Bundle Methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and find f (x) and g Update model and repeat till desired accuracy Numerically unstable

Subgradient and Bundle Methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and find f (x) and g Update model and repeat till desired accuracy Numerically unstable

Subgradient and Bundle Methods

Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to x ∈C Construct an Approximate Model: f (x) = maxi∈I (ˆ(xi ) + gi T (x − xi ) f Minimize model over x and find f (x) and g Update model and repeat till desired accuracy Numerically unstable

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to minimize f (x) λ F (x) = miny ∈Rn f (y ) + ||y − x||2 2 λ p(x) = argminy ∈Rn f (y ) + ||y − x||2 2 F (x) is differentiable! F (x) = λ(x − p(x)) Minimization is done using the dual. Cutting Plane Method + Moreau-Yosida Regularization = Bundle Methods

Subgradient and Bundle Methods

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous At a generic iteration we maintain a “bundle” < yi , f (yi ), si , αi >

Subgradient and Bundle Methods

Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous At a generic iteration we maintain a “bundle” < yi , f (yi ), si , αi >

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization for building the model µk ˆ yk +1 = argminy ∈Rn ˆk (y ) + f ||y − x k ||2 2 µk k +1 ˆ ˆ δk = f (x k ) − [ˆk (y k +1 ) + f ||y − x k ||2 ] ≥ 0 2 if δk < δ stop ˆ If f (x k ) − f (y k +1 ) ≥ mδk ˆ Serious Step x k +1 = y k +1 ˆ ˆ else Null Step x k +1 = x k ˆk +1 (y ) = max{ˆk (y ), f (y k +1 ) + sk +1 , y − y k +1 } f f

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ Then, k ∈Ks δk ≤ so δk → 0 m

Subgradient and Bundle Methods

Convergence

The Algorithm either makes a finite number of Serious Steps and then only makes Null steps Then, If k0 is the last Serious Step, and µk is nondecreasing, then δk → 0 Or it makes an infinite number of Serious steps ˆ f (x0 ) − f ∗ so δk → 0 Then, k ∈Ks δk ≤ m

Subgradient and Bundle Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modification of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Subgradient and Bundle Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modification of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Subgradient and Bundle Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modification of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Subgradient and Bundle Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modification of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Subgradient and Bundle Methods

Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable Conjuguate Gradient methods are achieved as a slight modification of the algorithm (Refer [5]) Variable Metric Methods [10] Mk = uk I for Diagonal Variable Metric Methods Bundle-Newton Methods

Subgradient and Bundle Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Subgradient and Bundle Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Subgradient and Bundle Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Subgradient and Bundle Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Subgradient and Bundle Methods

Summary

Nonsmooth convex optimization has been explored since 1960’s. The original subgradient methods were introduced by Naum Shor. Bundle methods have been developed more recently. Subgradient Methods are simple but slow, unless distributed, which is the predominant current application. Bundle Methods solve a bounded QP, which is slow, but need fewer iterations. Preferred for applications where oracle cost is high.

Subgradient and Bundle Methods

For Further Reading I

Naum Z. Shor Minimization Methods for non-differentiable functions. Springer-Verlag, 1985. Boyd and Vanderberge Convex Optimization. Cambridge University Press A. Ruszczyinski Nonlinear Optimization Princeton University Press Wikipedia en.wikipedia.org/wiki/Subgradient_method

Subgradient and Bundle Methods

For Further Reading II

Marko Makela Survey of Bundle Methods, 2009 http://www.informaworld.com/smpp/ content~db=all~content=a713741700 Alexandre Belloni An Introduction to Bundle Methods http://web.mit.edu/belloni/www/ LecturesIntroBundle.pdf John E. Mitchell Cutting Plane and Subgradient Methods, 2005 http://www.optimization-online.org/DB_HTML/ 2009/05/2298.html

Subgradient and Bundle Methods

For Further Reading III
Lecture Notes on Subgradient methods by Stephen Boyd http://www.stanford.edu/class/ee392o/ subgrad_method.pdf Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le Bundle Methods for Machine Learning, 2007 http://books.nips.cc/papers/files/nips20/ NIPS2007_0470.pdf C Lemarechal Variable metric bundle methods, 1997. http://www.springerlink.com/index/ 3515WK428153171N.pdf Quoc Le, Alexander Smola Direct Optimization of Ranking Measures, 2007 http://arxiv.org/abs/0704.3359

Subgradient and Bundle Methods

For Further Reading IV

SVN Vishwanathan, A. Smola Quasi-Newton Methods for Efficient Large-Scale Machine Learning http://portal.acm.org/ft_gateway.cfm?id= 1390309&type=pdf and www.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Subgradient and Bundle Methods

Sign up to vote on this title
UsefulNot useful