This action might not be possible to undo. Are you sure you want to continue?

# Three strategies to derive a dual problem

Ryota Tomioka

May 18, 2010

There are three strategies, namely, (i) equality constraints, (ii) conic con-

straints, and (iii) Fenchel’s duality to derive a dual problem.

Using a group-lasso regularized support vector machine (=MKL) problem

as an example, we see how these strategies can be used to derive dual problems

that look diﬀerent but are actually equivalent.

More speciﬁcally, we are interested in a dual of the following problem:

(P) minimize

w∈R

n

m

∑

i=1

ℓ

H

(x

(i)⊤

w, y

(i)

) + λ

∑

g∈G

∥w

g

∥

2

,

where {x

(i)

, y

(i)

}

m

i=1

(x

(i)

∈ R

n

) are training examples, ℓ

H

(z

i

, y

i

) := (1 −y

i

z

i

)

+

is the hinge loss function, and G is a disjoint partition of {1, 2, . . . , n}; i.e.,

∪

g∈G

g = {1, 2, . . . , n} and g

1

, g

2

∈ G and g

1

̸= g

2

imply g

1

∩ g

2

= ∅.

1 Using equality constraints

The most basic technique in deriving a dual problem can be summarized as

follows:

1. Find an equality constraint.

2. If you cannot ﬁnd an equality constraint, introduce auxiliary variables to

create an equality constraint.

3. Form a Lagrangian. Introduce Lagrangian multipliers for every equality

constraint.

4. Try to minimize the Lagrangian with respect to the primal variables.

5. If the minimization is too hard, introduce more auxiliary variables, and

go back to 3.

6. If you can minimize the Lagrangian, check when it takes a ﬁnite value and

when it becomes −∞. This will give you the dual constraints.

1

Following the above recipe, we ﬁrst notice that there is no equality constraint

in the above primal problem (P). Thus we introduce an auxiliary variable z ∈

R

m

, and rewrite (P) as follows:

(P

1

) minimize

w∈R

n

,z∈R

m

m

∑

i=1

(1 −z

i

)

+

+ λ

∑

g∈G

∥w

g

∥

2

,

subject to y

(i)

x

(i)⊤

w = z

i

(i = 1, . . . , m).

Note that the way we introduce equality constraints is not unique. For example,

we could have (1−y

i

z

i

)

+

in the objective subject to x

(i)⊤

w = z

i

. Nevertheless,

as long as the mapping is one to one, this choice is not important. The current

choice is made to mimic the most common representation of SVM dual.

Now we are ready to form a Lagrangian L(w, z, α), where α = (α

i

)

m

i=1

are

Lagrangian multipliers associated with the m equality constraints in (P

1

). The

Lagrangian can be written as follows:

L(w, z, α) =

m

∑

i=1

(1 −z

i

)

+

+ λ

∑

g∈G

∥w

g

∥

2

+

m

∑

i=1

α

i

(z

i

−y

(i)

x

(i)⊤

w).

The dual function d(α) is obtained by minimizing the Lagrangian L(w, z, α)

with respect to the primal variables w and z as follows:

d(α) = inf

w,z

m

∑

i=1

(1 −z

i

)

+

+ λ

∑

g∈G

∥w

g

∥

2

+

m

∑

i=1

α

i

(z

i

−y

(i)

x

(i)⊤

w)

=

m

∑

i=1

inf

z

i

(max(0, 1 −z

i

) + α

i

z

i

) +

∑

g∈G

inf

w

g

∈R

|g|

λ∥w

g

∥

2

−

m

∑

i=1

α

i

y

(i)

x

(i)

g

⊤

w

g

,

=

m

∑

i=1

inf

z

i

max(α

i

z

i

, (α

i

−1)z

i

+ 1) +

∑

g∈G

inf

w

g

∈R

|g|

λ∥w

g

∥

2

−

m

∑

i=1

α

i

y

(i)

x

(i)

g

⊤

w

g

,

which takes a ﬁnite value

∑

m

i=1

α

i

if the following conditions are satisﬁed

α

i

≥ 0, α

i

−1 ≤ 0,

m

∑

i=1

α

i

y

(i)

x

(i)

g

2

≤ λ.

Otherwise d(α) = −∞ (a trivial lower bound).

Accordingly, we obtain the following dual problem:

(D

1

) maximize

m

∑

i=1

α

i

,

subject to 0 ≤ α

i

≤ 1 (i = 1, . . . , m),

m

∑

i=1

α

i

y

(i)

x

(i)

g

2

≤ λ (g ∈ G).

2

2 Using conic constraints

The second strategy to derive a dual problem is based on ﬁnding a conic struc-

ture in the primal problem. A cone K is a subset of some vector space such

that if x ∈ K, for any nonnegative α, we have αx ∈ K.

The most common cone we encounter is the positive orthant cone; i.e.,

K = {x ∈ R

n

: x ≥ 0}.

Another commonly used cone is the second-order cone; i.e.,

K = {(x

0

, x

⊤

)

⊤

∈ R

n+1

: x

0

≥ ∥x∥

2

}.

The dual cone K

∗

of a cone K is deﬁned as follows:

K

∗

= {y ∈ R

n

: y

⊤

x ≥ 0 (∀x ∈ K)}.

In other words, the dual cone is a collection of vectors that have nonnegative

inner products with all the vectors in K. Note that both the positive orthant

cone and the second order cone are self-dual; i.e., K

∗

= K.

Why is a cone useful? Because, when we consider the minimization of a

Lagrangian and see a term like f(α)

⊤

x and x ∈ K, we know that the minimum

is zero if f(α) ∈ K

∗

and −∞ otherwise (because if f(α) / ∈ K

∗

we can ﬁnd a

vector x ∈ K such that f(α)

⊤

x < 0, and even if f(α)

⊤

x is very close to zero,

we can ﬁnd a very large α > 0 and drive f(α)

⊤

(αx) to −∞).

Let us consider a conic programming problem

(P

C

) minimize

x∈R

n

c

⊤

x,

subject to Ax = b, x ∈ K,

where K is a cone. The dual problem of (P

C

) can be written as follows:

(D

C

) maximize

α∈R

m

b

⊤

α,

subject to c −A

⊤

α ∈ K

∗

,

where K

∗

is the dual cone of K. The derivation of (D

C

) (and some generaliza-

tion) is given in Appendix A.

Now we rewrite the primal problem (P) as a conic programming problem as

follows:

(P

2

) minimize

w∈R

n

,ξ∈R

m

,

˜

ξ∈R

m

,u

g

∈R(g∈G)

m

∑

i=1

ξ

i

+ λ

∑

g∈G

u

g

,

subject to y

(i)

x

(i)⊤

w + ξ

i

−

˜

ξ

i

= 1,

ξ

i

,

˜

ξ

i

≥ 0 (i = 1, . . . , m), ∥w

g

∥

2

≤ u

g

(g ∈ G).

3

By deﬁning

x = (ξ

⊤

,

˜

ξ

⊤

, u

⊤

, w

⊤

)

⊤

,

c = (1

m

⊤

, 0

m

⊤

, λ1

|G|

⊤

, 0

m

⊤

)

⊤

,

A =

1 −1 y

(1)

x

(1)⊤

.

.

.

.

.

.

0

.

.

.

1 −1 y

(m)

x

(m)

⊤

¸

¸

¸,

b = 1

m

,

we notice that (P

2

) is a conic programming problem. In fact, the cone K can

be written as

K =

(ξ

⊤

,

˜

ξ

⊤

, u

⊤

, w

⊤

) ∈ R

2m+n+|G|

: ξ ≥ 0,

˜

ξ ≥ 0, u

g

≥ ∥w

g

∥

2

(∀g ∈ G)

¸

.

Note that K is self dual; i.e., K

∗

= K. Accordingly the dual of (P

2

) can be

written as follows:

(D

2

) maximize

m

∑

i=1

α

i

,

subject to 1

m

−α ≥ 0

m

,

0

m

+α ≥ 0

m

,

λ ≥

m

∑

i=1

α

i

y

(i)

x

i

2

(∀g ∈ G).

The dual problem (D

2

) is clearly equivalent to (D

1

).

3 Using Fenchel’s duality

Fenchel’s duality theorem [Rockafellar, 1970, Theorem 31.1] states that for two

proper closed convex functions f and g, we have

inf

x∈R

n

(f(Ax) + g(x)) = sup

α∈R

m

(−f

∗

(−α) −g

∗

(A

⊤

α)), (1)

where f

∗

and g

∗

are the convex conjugate functions

1

of f and g, respectively.

The derivation of Eq. (1) is given in Appendix B.

The problem (P) can be rewritten as follows:

(P

3

) minimize

w∈R

n

f(Aw) + g(w),

1

The convex conjugate function f

∗

of a function f is deﬁned as f

∗

(y) =

sup

x

`

y

⊤

x−f(x)

´

. If f is proper closed convex function, f

∗

= f.

4

where

f(z) =

m

∑

i=1

(1 −z

i

)

+

,

g(w) = λ

∑

g∈Gf

∥w

g

∥

2

,

A =

y

(1)

x

(1)⊤

.

.

.

y

(m)

x

(m)

.

¸

¸

¸

Using Fenchel’s duality theorem, the dual problem of (P

3

) can be written as

follows:

(D

′

3

) minimize

α∈R

m

−f

∗

(−α) −g

∗

(A

⊤

α).

The remaining task, therefore, is to compute the convex conjugate functions f

∗

and g

∗

.

First we compute f

∗

. By deﬁnition,

f

∗

(−α) = sup

z∈R

m

−z

⊤

α−

m

∑

i=1

max(0, 1 −z

i

)

=

m

∑

i=1

sup

z

i

min (−α

i

z

i

, (1 −α

i

)z

i

−1)

=

m

∑

i=1

+∞ (if α

i

< 0),

−α

i

(if 0 ≤ α

i

≤ 1),

+∞ (if α

i

> 1).

Next, we compute g

∗

. First we show a lower bound of g

∗

as follows:

g

∗

(y) = sup

w∈R

n

¸

y

⊤

w −λ

∑

g∈G

∥w

g

∥

2

¸

=

∑

g∈G

sup

w

g

∈R

|g|

y

g

⊤

w

g

−λ∥w

g

∥

2

=

∑

g∈G

sup

t

sup

w

g

:∥w

g

∥≤t

y

g

⊤

w

g

−λ∥w

g

∥

2

≥

∑

g∈G

sup

t

(∥y

g

∥

2

−λ)t

=

∑

g∈G

0 (if ∥y

g

∥

2

≤ λ),

+∞ (otherwise).

5

α

i

=1 α

i

=0

f

∗

(−α

i

)

(a) The conjugate hinge loss.

||y

g

||<=λ

g

∗

(y

g

)

(b) The conjugate regularizer.

Figure 1: The shapes of convex conjugate functions f

∗

(−α) and g

∗

(y) in 1D.

Next we show that the above lower bound is tight. In fact, if ∥y

g

∥

2

≤ λ, we have

y

g

⊤

w

g

≤ λ∥w

g

∥

2

(Cauchy-Schwarz inequality), which implies

y

g

⊤

w

g

−λ∥w

g

∥

2

≤

0.

Finally, substituting the above f

∗

and g

∗

into (D

′

3

), we obtain the following

dual problem.

(D

3

) maximize

∑

m

i=1

α

i

if 0 ≤ α

i

≤ 1 (i = 1, . . . , m),

and

∑

m

i=1

α

i

y

(i)

x

(i)

g

2

≤ λ (∀g ∈ G),

−∞ (otherwise).

Note that dual problem (D

3

) with the above f

∗

and g

∗

are equivalent to

both (D

1

) and (D

2

).

Figure 1 shows the rough shape of conjugate functions f

∗

(−α) and g

∗

(y).

References

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge

University Press, 2004.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

A Derivation of (D

C

) and generalization to arbi-

trary loss functions

Similarly to the derivation in Sec. 1, the Lagrangian of (P

C

) is written as follows:

L(x, α) = c

⊤

x +α

⊤

(b −Ax).

6

The dual function d(α) is obtained by minimizing the Lagrangian L(x, α)

with respect to x as follows:

d(α) = inf

x∈K

c

⊤

x + (b −Ax)

⊤

α

= b

⊤

α+ inf

x∈K

(c −A

⊤

α)

⊤

x.

Note that the minimization with respect to x is constrained in the cone K.

Thus, for the minimum to exist, it is necessary and suﬃcient that c−A

⊤

α ∈ K

∗

(recall, by deﬁnition that for any x ∈ K and y ∈ K

∗

, x

⊤

y ≥ 0, and if y / ∈ K

∗

there exists x ∈ K such that x

⊤

y < 0). If c −A

⊤

α ∈ K

∗

, then the minimum

is zero, because for any x ∈ K and y ∈ K

∗

, x

⊤

y ≥ 0. Accordingly, we obtain

the dual porblem (D

C

).

**The above conic duality can be generalized to arbitrary convex loss function
**

f(x) instead of c

⊤

x.

Let us consider the following primal probelm:

(P

′

C

) minimize

x∈R

n

f(x),

subject to Ax = b, x ∈ K,

where K is a cone.

Introducing an auxiliary variable z ∈ R

n

, we can rewrite (P

′

C

) as follows:

(P

′′

C

) minimize

x,z∈R

n

f(z),

subject to Ax = b,

z = x, x ∈ K.

The Lagrangian L(w, z, α) of (P

′′

C

) can be written as follows:

L(w, z, α) = f(z) +α

⊤

(b −Ax) +β

⊤

(x −z),

where α ∈ R

m

and β ∈ R

n

are Lagrangian multipliers.

The dual function d(α) can be obtained by minimizing L(w, z, α) with re-

spect to w and z as follows:

d(α) = inf

w,z

f(z) +α

⊤

(b −Ax) +β

⊤

(x −z)

= b

⊤

α+ inf

z∈R

m

f(z) −β

⊤

z

+ inf

w∈R

n

β −A

⊤

α

⊤

w

= b

⊤

α− sup

z∈R

m

β

⊤

z −f(z)

+ inf

w∈R

n

β −A

⊤

α

⊤

w

= b

⊤

α−f

∗

(β) + inf

w∈R

n

β −A

⊤

α

⊤

w,

where f

∗

is the convex conjugate of f. Note that the minimization with respect

to w takes a ﬁnite value zero if and only if β − A

⊤

α ∈ K

∗

(otherwise d(α) =

−∞).

7

Accordingly, the dual problem is written as follows:

(D

′′

C

) maximize

α∈R

m

,β∈R

n

b

⊤

α−f

∗

(β),

subject to β −A

⊤

α ∈ K

∗

.

Note that if f(x) = c

⊤

x, f

∗

(β) = 0 if β = c, and f

∗

(β) = +∞ otherwise.

Therefore, (D

′′

C

) reduces to (D

C

).

B Derivation of Fenchel’s duality theorem

First, we introduce an equality constraint and rewrite the left-hand side of

Eq. (1) as follows

(P

F

) minimize

x∈R

n

,z∈R

m

f(z) + g(x),

subject to Ax = z.

The Lagrangian L(x, z, α) of the equality constrained problem (P

F

) can be

written as follows:

L(x, z, α) = f(z) + g(x) +α

⊤

(z −Ax).

Minimizing the Lagrangian L(x, z, α) with respect to x and z we obtain the

dual function d(α) as follows:

d(α) = inf

x,z

f(z) + g(x) +α

⊤

(z −Ax)

= inf

z

f(z) +α

⊤

z

+ inf

x

g(x) −α

⊤

Ax

= −sup

z

(−α)

⊤

z −f(z)

−sup

x

(A

⊤

α)

⊤

x −g(x)

= −f

∗

(−α) −g

∗

(A

⊤

α)

If both f and g are convex, (P

F

) satisﬁes Slater’s condition [Boyd and Van-

denberghe, 2004] and the strong duality holds. Therefore we have Eq. (1).

**Fenchel’s duality theorem can be generalized to the following problem
**

(P

′

F

) minimize

x

f(Ax) + g(Bx),

whose dual problem can be written as

(D

′

F

) maximize

α,β

−f

∗

(−α) −g

∗

(α),

subject to A

⊤

α = B

⊤

β.

If B = I

n

(identity matrix), the above duality is equivalent to Fenchel’ duality

theorem.

8