Three strategies to derive a dual problem

Ryota Tomioka
May 18, 2010
There are three strategies, namely, (i) equality constraints, (ii) conic con-
straints, and (iii) Fenchel’s duality to derive a dual problem.
Using a group-lasso regularized support vector machine (=MKL) problem
as an example, we see how these strategies can be used to derive dual problems
that look different but are actually equivalent.
More specifically, we are interested in a dual of the following problem:
(P) minimize
w∈R
n
m

i=1

H
(x
(i)⊤
w, y
(i)
) + λ

g∈G
∥w
g

2
,
where {x
(i)
, y
(i)
}
m
i=1
(x
(i)
∈ R
n
) are training examples, ℓ
H
(z
i
, y
i
) := (1 −y
i
z
i
)
+
is the hinge loss function, and G is a disjoint partition of {1, 2, . . . , n}; i.e.,

g∈G
g = {1, 2, . . . , n} and g
1
, g
2
∈ G and g
1
̸= g
2
imply g
1
∩ g
2
= ∅.
1 Using equality constraints
The most basic technique in deriving a dual problem can be summarized as
follows:
1. Find an equality constraint.
2. If you cannot find an equality constraint, introduce auxiliary variables to
create an equality constraint.
3. Form a Lagrangian. Introduce Lagrangian multipliers for every equality
constraint.
4. Try to minimize the Lagrangian with respect to the primal variables.
5. If the minimization is too hard, introduce more auxiliary variables, and
go back to 3.
6. If you can minimize the Lagrangian, check when it takes a finite value and
when it becomes −∞. This will give you the dual constraints.
1
Following the above recipe, we first notice that there is no equality constraint
in the above primal problem (P). Thus we introduce an auxiliary variable z ∈
R
m
, and rewrite (P) as follows:
(P
1
) minimize
w∈R
n
,z∈R
m
m

i=1
(1 −z
i
)
+
+ λ

g∈G
∥w
g

2
,
subject to y
(i)
x
(i)⊤
w = z
i
(i = 1, . . . , m).
Note that the way we introduce equality constraints is not unique. For example,
we could have (1−y
i
z
i
)
+
in the objective subject to x
(i)⊤
w = z
i
. Nevertheless,
as long as the mapping is one to one, this choice is not important. The current
choice is made to mimic the most common representation of SVM dual.
Now we are ready to form a Lagrangian L(w, z, α), where α = (α
i
)
m
i=1
are
Lagrangian multipliers associated with the m equality constraints in (P
1
). The
Lagrangian can be written as follows:
L(w, z, α) =
m

i=1
(1 −z
i
)
+
+ λ

g∈G
∥w
g

2
+
m

i=1
α
i
(z
i
−y
(i)
x
(i)⊤
w).
The dual function d(α) is obtained by minimizing the Lagrangian L(w, z, α)
with respect to the primal variables w and z as follows:
d(α) = inf
w,z

m

i=1
(1 −z
i
)
+
+ λ

g∈G
∥w
g

2
+
m

i=1
α
i
(z
i
−y
(i)
x
(i)⊤
w)

=
m

i=1
inf
z
i
(max(0, 1 −z
i
) + α
i
z
i
) +

g∈G
inf
w
g
∈R
|g|

λ∥w
g

2

m

i=1
α
i
y
(i)
x
(i)
g

w
g

,
=
m

i=1
inf
z
i
max(α
i
z
i
, (α
i
−1)z
i
+ 1) +

g∈G
inf
w
g
∈R
|g|

λ∥w
g

2

m

i=1
α
i
y
(i)
x
(i)
g

w
g

,
which takes a finite value

m
i=1
α
i
if the following conditions are satisfied
α
i
≥ 0, α
i
−1 ≤ 0,

m

i=1
α
i
y
(i)
x
(i)
g

2
≤ λ.
Otherwise d(α) = −∞ (a trivial lower bound).
Accordingly, we obtain the following dual problem:
(D
1
) maximize
m

i=1
α
i
,
subject to 0 ≤ α
i
≤ 1 (i = 1, . . . , m),

m

i=1
α
i
y
(i)
x
(i)
g

2
≤ λ (g ∈ G).
2
2 Using conic constraints
The second strategy to derive a dual problem is based on finding a conic struc-
ture in the primal problem. A cone K is a subset of some vector space such
that if x ∈ K, for any nonnegative α, we have αx ∈ K.
The most common cone we encounter is the positive orthant cone; i.e.,
K = {x ∈ R
n
: x ≥ 0}.
Another commonly used cone is the second-order cone; i.e.,
K = {(x
0
, x

)

∈ R
n+1
: x
0
≥ ∥x∥
2
}.
The dual cone K

of a cone K is defined as follows:
K

= {y ∈ R
n
: y

x ≥ 0 (∀x ∈ K)}.
In other words, the dual cone is a collection of vectors that have nonnegative
inner products with all the vectors in K. Note that both the positive orthant
cone and the second order cone are self-dual; i.e., K

= K.
Why is a cone useful? Because, when we consider the minimization of a
Lagrangian and see a term like f(α)

x and x ∈ K, we know that the minimum
is zero if f(α) ∈ K

and −∞ otherwise (because if f(α) / ∈ K

we can find a
vector x ∈ K such that f(α)

x < 0, and even if f(α)

x is very close to zero,
we can find a very large α > 0 and drive f(α)

(αx) to −∞).
Let us consider a conic programming problem
(P
C
) minimize
x∈R
n
c

x,
subject to Ax = b, x ∈ K,
where K is a cone. The dual problem of (P
C
) can be written as follows:
(D
C
) maximize
α∈R
m
b

α,
subject to c −A

α ∈ K

,
where K

is the dual cone of K. The derivation of (D
C
) (and some generaliza-
tion) is given in Appendix A.
Now we rewrite the primal problem (P) as a conic programming problem as
follows:
(P
2
) minimize
w∈R
n
,ξ∈R
m
,
˜
ξ∈R
m
,u
g
∈R(g∈G)
m

i=1
ξ
i
+ λ

g∈G
u
g
,
subject to y
(i)
x
(i)⊤
w + ξ
i

˜
ξ
i
= 1,
ξ
i
,
˜
ξ
i
≥ 0 (i = 1, . . . , m), ∥w
g

2
≤ u
g
(g ∈ G).
3
By defining
x = (ξ

,
˜
ξ

, u

, w

)

,
c = (1
m

, 0
m

, λ1
|G|

, 0
m

)

,
A =

1 −1 y
(1)
x
(1)⊤
.
.
.
.
.
.
0
.
.
.
1 −1 y
(m)
x
(m)

¸
¸
¸,
b = 1
m
,
we notice that (P
2
) is a conic programming problem. In fact, the cone K can
be written as
K =



,
˜
ξ

, u

, w

) ∈ R
2m+n+|G|
: ξ ≥ 0,
˜
ξ ≥ 0, u
g
≥ ∥w
g

2
(∀g ∈ G)
¸
.
Note that K is self dual; i.e., K

= K. Accordingly the dual of (P
2
) can be
written as follows:
(D
2
) maximize
m

i=1
α
i
,
subject to 1
m
−α ≥ 0
m
,
0
m
+α ≥ 0
m
,
λ ≥

m

i=1
α
i
y
(i)
x
i

2
(∀g ∈ G).
The dual problem (D
2
) is clearly equivalent to (D
1
).
3 Using Fenchel’s duality
Fenchel’s duality theorem [Rockafellar, 1970, Theorem 31.1] states that for two
proper closed convex functions f and g, we have
inf
x∈R
n
(f(Ax) + g(x)) = sup
α∈R
m
(−f

(−α) −g

(A

α)), (1)
where f

and g

are the convex conjugate functions
1
of f and g, respectively.
The derivation of Eq. (1) is given in Appendix B.
The problem (P) can be rewritten as follows:
(P
3
) minimize
w∈R
n
f(Aw) + g(w),
1
The convex conjugate function f

of a function f is defined as f

(y) =
sup
x
`
y

x−f(x)
´
. If f is proper closed convex function, f

= f.
4
where
f(z) =
m

i=1
(1 −z
i
)
+
,
g(w) = λ

g∈Gf
∥w
g

2
,
A =

y
(1)
x
(1)⊤
.
.
.
y
(m)
x
(m)
.
¸
¸
¸
Using Fenchel’s duality theorem, the dual problem of (P
3
) can be written as
follows:
(D

3
) minimize
α∈R
m
−f

(−α) −g

(A

α).
The remaining task, therefore, is to compute the convex conjugate functions f

and g

.
First we compute f

. By definition,
f

(−α) = sup
z∈R
m

−z

α−
m

i=1
max(0, 1 −z
i
)

=
m

i=1
sup
z
i
min (−α
i
z
i
, (1 −α
i
)z
i
−1)
=
m

i=1

+∞ (if α
i
< 0),
−α
i
(if 0 ≤ α
i
≤ 1),
+∞ (if α
i
> 1).
Next, we compute g

. First we show a lower bound of g

as follows:
g

(y) = sup
w∈R
n

¸
y

w −λ

g∈G
∥w
g

2
¸

=

g∈G
sup
w
g
∈R
|g|

y
g

w
g
−λ∥w
g

2

=

g∈G
sup
t
sup
w
g
:∥w
g
∥≤t

y
g

w
g
−λ∥w
g

2



g∈G
sup
t
(∥y
g

2
−λ)t
=

g∈G

0 (if ∥y
g

2
≤ λ),
+∞ (otherwise).
5
α
i
=1 α
i
=0
f

(−α
i
)
(a) The conjugate hinge loss.
||y
g
||<=λ
g

(y
g
)
(b) The conjugate regularizer.
Figure 1: The shapes of convex conjugate functions f

(−α) and g

(y) in 1D.
Next we show that the above lower bound is tight. In fact, if ∥y
g

2
≤ λ, we have
y
g

w
g
≤ λ∥w
g

2
(Cauchy-Schwarz inequality), which implies

y
g

w
g
−λ∥w
g

2


0.
Finally, substituting the above f

and g

into (D

3
), we obtain the following
dual problem.
(D
3
) maximize


m
i=1
α
i

if 0 ≤ α
i
≤ 1 (i = 1, . . . , m),
and


m
i=1
α
i
y
(i)
x
(i)
g

2
≤ λ (∀g ∈ G),

−∞ (otherwise).
Note that dual problem (D
3
) with the above f

and g

are equivalent to
both (D
1
) and (D
2
).
Figure 1 shows the rough shape of conjugate functions f

(−α) and g

(y).
References
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge
University Press, 2004.
R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
A Derivation of (D
C
) and generalization to arbi-
trary loss functions
Similarly to the derivation in Sec. 1, the Lagrangian of (P
C
) is written as follows:
L(x, α) = c

x +α

(b −Ax).
6
The dual function d(α) is obtained by minimizing the Lagrangian L(x, α)
with respect to x as follows:
d(α) = inf
x∈K

c

x + (b −Ax)

α

= b

α+ inf
x∈K
(c −A

α)

x.
Note that the minimization with respect to x is constrained in the cone K.
Thus, for the minimum to exist, it is necessary and sufficient that c−A

α ∈ K

(recall, by definition that for any x ∈ K and y ∈ K

, x

y ≥ 0, and if y / ∈ K

there exists x ∈ K such that x

y < 0). If c −A

α ∈ K

, then the minimum
is zero, because for any x ∈ K and y ∈ K

, x

y ≥ 0. Accordingly, we obtain
the dual porblem (D
C
).

The above conic duality can be generalized to arbitrary convex loss function
f(x) instead of c

x.
Let us consider the following primal probelm:
(P

C
) minimize
x∈R
n
f(x),
subject to Ax = b, x ∈ K,
where K is a cone.
Introducing an auxiliary variable z ∈ R
n
, we can rewrite (P

C
) as follows:
(P
′′
C
) minimize
x,z∈R
n
f(z),
subject to Ax = b,
z = x, x ∈ K.
The Lagrangian L(w, z, α) of (P
′′
C
) can be written as follows:
L(w, z, α) = f(z) +α

(b −Ax) +β

(x −z),
where α ∈ R
m
and β ∈ R
n
are Lagrangian multipliers.
The dual function d(α) can be obtained by minimizing L(w, z, α) with re-
spect to w and z as follows:
d(α) = inf
w,z

f(z) +α

(b −Ax) +β

(x −z)

= b

α+ inf
z∈R
m

f(z) −β

z

+ inf
w∈R
n

β −A

α


w
= b

α− sup
z∈R
m

β

z −f(z)

+ inf
w∈R
n

β −A

α


w
= b

α−f

(β) + inf
w∈R
n

β −A

α


w,
where f

is the convex conjugate of f. Note that the minimization with respect
to w takes a finite value zero if and only if β − A

α ∈ K

(otherwise d(α) =
−∞).
7
Accordingly, the dual problem is written as follows:
(D
′′
C
) maximize
α∈R
m
,β∈R
n
b

α−f

(β),
subject to β −A

α ∈ K

.
Note that if f(x) = c

x, f

(β) = 0 if β = c, and f

(β) = +∞ otherwise.
Therefore, (D
′′
C
) reduces to (D
C
).
B Derivation of Fenchel’s duality theorem
First, we introduce an equality constraint and rewrite the left-hand side of
Eq. (1) as follows
(P
F
) minimize
x∈R
n
,z∈R
m
f(z) + g(x),
subject to Ax = z.
The Lagrangian L(x, z, α) of the equality constrained problem (P
F
) can be
written as follows:
L(x, z, α) = f(z) + g(x) +α

(z −Ax).
Minimizing the Lagrangian L(x, z, α) with respect to x and z we obtain the
dual function d(α) as follows:
d(α) = inf
x,z

f(z) + g(x) +α

(z −Ax)

= inf
z

f(z) +α

z

+ inf
x

g(x) −α

Ax

= −sup
z

(−α)

z −f(z)

−sup
x

(A

α)

x −g(x)

= −f

(−α) −g

(A

α)
If both f and g are convex, (P
F
) satisfies Slater’s condition [Boyd and Van-
denberghe, 2004] and the strong duality holds. Therefore we have Eq. (1).

Fenchel’s duality theorem can be generalized to the following problem
(P

F
) minimize
x
f(Ax) + g(Bx),
whose dual problem can be written as
(D

F
) maximize
α,β
−f

(−α) −g

(α),
subject to A

α = B

β.
If B = I
n
(identity matrix), the above duality is equivalent to Fenchel’ duality
theorem.
8

Sign up to vote on this title
UsefulNot useful