You are on page 1of 4

Recursive Decomposition for Nonconvex Optimization

Supplementary Material
Abram L. Friesen and Pedro Domingos
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195, USA
{afriesen,pedrod}@cs.washington.edu

log (n/d) Plogk (n/d)−1 r


This supplement to the paper “Recursive Decomposition T (n) = c1 (k ξ(d)) k +
 c2 n r=0 ξ(d) .
for Nonconvex Optimization,”, published in IJCAI 2015, pro- n

which is O (k ξ(d))logk (n/d) = O d ξ(d)logk (n/d)
.
vides proofs for the results in the paper, additional implemen-
tation details on RDIS, and additional experimental details for A.2 Convergence
the results shown in the paper. The experimental details in- In the following, we refer to the basin of attraction of the
clude additional figures that did not fit in the space allotted to global minimum as the global basin. Formally, we define a
the main paper. basin of attraction as follows.
Definition 1. The basin of attraction of a stationary point c
A Analysis Details is the set of points B ⊆ Rn for which the sequence generated
by DR, initialized at x0 ∈ B, converges to c.
A.1 Complexity
Intuitively, at each level of recursion, RDIS with  = 0 par-
RDIS begins by choosing a block of variables, xC . Assuming titions x into {xC , xU }, sets values using the subspace opti-
that this choice is made heuristically using the PaToH library mizer ρC for xC , globally optimizes f |ρC (xU ) by recursively
for hypergraph partitioning, which is a multi-level technique, calling RDIS, and repeats. When the non-restart steps of the
then the complexity of choosing variables is linear (see Tri- subspace optimizer satisfy two practical conditions (below)
funović 2006, p. 81). Within the loop, RDIS chooses values of sufficient decrease in (1) the objective function (a standard
for xC , simplifies and decomposes the function, and finally Armijo condition) and (2) the gradient norm over two succes-
recurses. Let the complexity of choosing values using the sive partial updates, (i.e., conditions (3.1) and (3.3) of Bonet-
subspace optimizer be g(d), where |xC |= d, and let one call tini [2011]), then this process is equivalent to the 2-block
to the subspace optimizer be cheap relative to n (e.g., com- inexact Gauss-Seidel method (2B-IGS) described in Bonet-
puting the gradient of f with respect to xC or taking a step tini [2011] (c.f., Grippo and Sciandrone [1999]; Cassioli
on a grid). Simplification requires iterating through the set et al. [2013]), and each limit point of the sequence generated
of terms and computing bounds, so is linear in the number by RDIS is a stationary point of f (x), of which the global
of terms, m. The connected components are maintained by minimum is one, and reachable through restarts.
a dynamic graph algorithm [Holm et al., 2001] which has an Formally, let superscript r indicate the recursion level,
amortized complexity of O(log2 (|V |)) per operation, where with 0 n≤ r ≤ d, with o r = 0 the top, and recall that
|V | is the number of vertices in the graph. Finally, let the (r)
xU = xC
(r+1) (r+1)
, xU if there is no decomposition. The
number of iterations of the loop be a function of the dimen-
sion, ξ(d), since more dimensions generally require more following proofs focus on the no-decomposition case for clar-
restarts. ity; however, the extension to the decomposable case is trivial
since each sub-function of the decomposition is independent.
Proposition 1. If, at each level, RDIS chooses xC ⊆ x of size We denote applying the subspace optimizer to f (x) until the
|xC |= d such that, for each selected value ρC , the simplified stopping criterion is reached as S∗ (f, x) and a single call to
function fˆ|ρC (xU ) locally decomposes into k > 1 indepen- the subspace optimizer as S1 (f, x) and note that S∗ (f, x), by
dent sub-functions {fˆi (xUi )} with equal-sized domains xUi , definition, returns the global minimum x∗ and that repeatedly
then the time complexity of RDIS is O( nd ξ(d)logk (n/d) ). calling S1 (f, x) is equivalent to calling S∗ (f, x).
For convenience, we restate conditions (3.1) and (3.3)
from Bonettini [2011] (without constraints) on the sequence
Proof. Assuming that m is of the same order as n,
{x(k) } generated by an iterative algorithm on blocks xi for
the recurrence relation for RDIS is T (n) = O(n)+
i = 1, . . . , m, respectively as
ξ(d) g(d) + O(m) + O(n) + O(d log2(n)) + k T n−d

k  ,
which can be simplified to T (n) = ξ(d) k T nk + O(n) + (k+1) (k+1)
f (x1 , . . . , xi , . . . , x(k)
m )
O(n). Noting that the recursion halts at T (d), the
(k+1) (k) (k) (k)
solution to the above recurrence relation is then ≤ f (x1 , . . . , xi + λi di , . . . , x(k)
m ), (C1)
(k) (k)
where λi is computed using Armijo line search and di is sequence it is generating converges. Thus, for each restart,
a feasible descent direction, and since there are only two blocks and both the non-restart
(r) (r) (r)
(k+1) (k+1) steps of S1 (f |σ∗ (xC )) and the RDISDR (f |ρC (xU )) steps
||∇i f (x1 , . . . , xi , . . . , x(k)
m )|| U

(k+1) (k+1)
satisfy (C1) and (C2) then RDISDR is a 2B-IGS method and
≤ η||∇i f (x1 , . . . , xi−1 , . . . , x(k)
m )||, the generated sequence converges to the stationary point of
i = 1, . . . , m the current basin. At each level, after converging, RDISDR
(k+1) (k+1) will restart, iterate until convergence, and repeat for a finite
||∇i f (x1 , . . . , xi , . . . , x(k+1)
m )|| number of restarts, one of which will start in the global basin
(k+1) (k+1)
≤ η||∇i−1 f (x1 , . . . , xi−1 , . . . , x(k)
m )||,
and thus converge to the global minimum, which is then
returned.
i = 2, . . . , m
(k+2) (k+1)
Step 3. Finally, since the probability of RDISDR starting
||∇1 f (x1 , . . . , x(k+1)
, . . . , xi m )|| in the global basin is (v/V ), then the probability of it not
≤ η 1−m (k+1)
||∇m f (x1 (k+1)
, . . . , xm )||, (C2) starting in the global basin after t restarts is (1 − (v/V ))t .
From above, we have that RDISDR will return the global
where η ∈ [0, 1) is a forcing parameter. See Bonettini [2011] minimum if it starts in the global basin, thus RDISDR will
for further details. The inexact Gauss-Seidel method is de- return the global minimum after t restarts with probability
fined as every method that generates a sequence such that 1 − (1 − (v/V ))t .
these conditions hold and is guaranteed to converge to a crit-
ical point of f (x) when m = 2. Let RDISDR refer to B RDIS Subroutine Details
RDIS(f, x, x0 , S = DR,  = 0). B.1 Variable Selection
Proposition 2. If the non-restart steps of RDIS satisfy In hypergraph partitioning, the goal is to split the graph into
(C1) and (C2),  = 0, the number of variables is n, the vol- k components of approximately equal size while minimizing
ume of the global basin is v = ln , and the volume of the entire the number of hyperedges cut. Similarly, in order to maxi-
space is V = Ln , then RDISDR returns the global minimum mize decomposition, RDIS should choose the smallest block
after t restarts, with probability 1 − (1 − (v/V ))t . of variables that, when assigned, decomposes the remain-
Proof. Step 1. Given a finite number of restarts, one of which ing variables. Accordingly, RDIS constructs a hypergraph
starts in the global basin, then RDISDR , with no recursion, H = (V, E) with a vertex for each term, {ni ∈ V : fi ∈ f }
returns the global minimum and satisfies (C1) and (C2). This and a hyperedge for each variable, {ej ∈ E : xj ∈ x}, where
can be seen as follows. each hyperedge ej connects to all vertices ni for which the
(0) (0) corresponding term fi contains the variable xj . Partitioning
At r = 0, RDISDR chooses xC = x0 and xU = ∅
(0) H, the resulting cutset will be the smallest set of variables
and repeatedly calls S1 (f (0) , xC ). This is equivalent to call- that need to be removed in order to decompose the hyper-
(0)
ing S∗ (f (0) , xC ) = S∗ (f, x), which returns the global min- graph. And since assigning a variable to a constant effec-
∗ tively removes it from the optimization (and the hypergraph),
imum x . Thus, RDISDR returns the global minimum. Re-
turning the global minimum corresponds to a step in the exact the cutset is exactly the set that RDIS chooses on line 2.
Gauss-Seidel algorithm, which is a special case of the IGS al-
gorithm and, by definition, satisfies (C1) and (C2). B.2 Execution time
Step 2. Now, if the non-restart steps of S1 (f, x) satisfy Variable selection typically occupies only a tiny fraction of
(C1) and (C2), then RDISDR returns the global minimum. the runtime of RDIS, with the vast majority of RDIS’ exe-
We show this by induction on the levels of recursion. cution time spent computing gradients for the subspace opti-
Base case. From Step 1, we have that RDISDR (f (d) , x(d) ) mizer. A small, but non-negligible amount of time is spent
returns the global minimum and satisfies (C1) and (C2), maintaining the component graph, but this is much more effi-
since RDISDR does not recurse beyond this level. cient than if we were to recompute the connected components
Induction step. Assume that RDISDR (f (r+1) , x(r+1) ) each time, and the exponential gains from decomposition are
returns the global minimum. We now show that well worth the small upkeep costs.
RDISDR (f (r) , x(r) ) returns the global minimum.
RDISDR (f (r) , x(r) ) first partitions x(r) into the two C Experimental Details
(r) (r)
blocks xC and xU and then iteratively takes the fol- All experiments were run on the same compute cluster. Each
(r) (r) (r)
lowing two steps: ρC ← S1 (f |σ∗ (xC )) and ρU ←
(r) computer in the cluster was identical, with two 2.33GHz quad
(r)
U core Intel Xeon E5345 processors and 16GB of RAM. Each
RDISDR (f |ρC (xU )). The first simply calls the subspace algorithm was limited to a single thread.
(r)
optimizer on ρC . The second is a recursive call equivalent
C.1 Structure from Motion
to RDISDR (f (r+1) , x(r+1) ), which, from our inductive
(r) (r)∗ In the structure from motion task (bundle adjustment [Triggs
assumption, returns the global minimum ρU = xU of
(r) (r) (r)
et al., 2000]), the goal is to minimize the error between a
f |ρC (xU ) and satisfies (C1) and (C2). For S1 (f |σ∗ (xC )), dataset of points in a 2-D image and a projection of fitted 3-
U
RDISDR will never restart the subspace optimizer unless the D points representing a scene’s geometry onto fitted camera
×10 5
6
CGD

Current minimum value (relative to best)


BCD-CGD
5 RDIS-NRR
RDIS
arity 4
arity 8
4 arity 12

3
40
2

20 1

0
0 500 1000 1500 2000 2500 3000 3500
0 Time (seconds)

−20
10 Figure 2: Trajectories on the test function for the data in Figure 3
10 0 in the main paper. Sharp rises show restarts. Notably, RDIS-NRR
0 restarts much more often than the other algorithms because decom-
−10 −10
y x position allows it to move through the space much more efficiently.
Without internal restarting it gets stuck at the same local minima
as BCD-CGD and CGD. For arity 12, RDIS never performs a full
Figure 1: A 2-D example of the highly multimodal test function. restart and still finds the best minimum, despite using the same ini-
tial point as the other algorithms.
models. The variables are the parameters of the cameras and to predict this final conformation given a known sequence of
the positions of the points and the cameras. This problem amino acids. This requires minimizing an energy function
is highly-structured in a global sense: cameras only interact consisting mainly of a sum of pairwise distance-based terms
explicitly with points, creating a bipartite graph structure that representing chemical bonds, hydrophobic interactions, elec-
RDIS is able to exploit. The dataset used is the 49-camera, trostatic forces, etc., where, in the simplest case, the variables
7776-point data file from the Ladybug dataset [Agarwal et al., are the relative angles between the atoms. The optimal state is
2010], where the number of points is scaled proportionally to typically quite compact, with the amino acids and their atoms
the number of cameras used (i.e., if half the cameras were bonded tightly to one another and the volume of the protein
used, half of the points were included). There are 9 variables minimized. Each amino acid is composed of a backbone seg-
per camera and 3 variables per point. ment and a sidechain, where the backbone segment of each
amino acid connects to its neighbors in the chain, and the
C.2 Highly Multimodal Test Function sidechains branch off the backbone segment and form bonds
The test function is defined as follows. Given a height h, a with distant neighbors. The sidechain placement task is to
branching factor k, and a maximum arity a, we define a com- predict the conformation of the sidechains when the backbone
plete k-ary tree of variables of the specified height, with x0 atoms are fixed in place.
as the root. For all paths pj ∈ P in the Q tree of length lj ≤ a, Energies between amino acids are defined by the Lennard-
with lj even, we define a term tpj = xi ∈pj sin(xi ). The Jones potential function, as specified in the Rosetta protein
Pn
testPfunction is fh,k,a (x0 , . . . , xn ) = 2 folding library [Leaver-Fay et al., 2011]. The basic form
i=1 c0 xi + c1 xi +
c2 P tpj . The resulting function is a multidimensional si- of this function is ELJ (r) = rA12 − rB6 , where r is the dis-
nusoid placed in the basin of a quadratic function parameter- tance between two atoms and A and B are constants that
ized by c1 , with a linear slope defined by c0 . The constant c2 vary for different types of atoms. The Lennard-Jones poten-
controls the amplitude of the sinusoids. For our tests, we used tial in Rosetta is modified slightly so that it behaves better
c0 = 0.6, c1 = 0.1, and c2 = 12. A 2-D example of this func- when r is very large
P or very small. The full energy func-
tion is E(φ) = χ χ
tion is shown in Figure 1. We used a tree height of h = 11, φ Ejk (Rj ( j ), Rk ( k )), where Rj is an
with branching factor k = 2, resulting in a function of 4095 amino acid (also called a residue) in the protein, φ is the set
variables. We evaluated each of the algorithms on functions of all torsion angles, and φi ∈ χj are the angles for Rj . Each
with terms of arity a ∈ {4, 8, 12}, where a larger arity de- residue has between zero and four torsion angles that define
fines more complex dependencies between variables as well the conformation of its sidechain, depending on the type of
as more terms in the function. The functions for the three amino acid. The terms Ejk Pcompute the energy between pairs
χ χ
P
different arity levels had 16372, 24404, and 30036 terms, re- of residues as Ejk = aj ak LJ (r(aj ( j ), ak ( k ))),
E
spectively. where aj and ak refer to the positions of the atoms in residues
Figure 2 shows the value of the current state for each algo- j and k, respectively, and r(aj , ak ) is the distance between
rithm over its entire execution. These are the trajectories for the two atoms. The torsion angles define the positions of the
Figure 3 of the main paper. atoms through a series of kinematic relations, which we do
not detail here.
C.3 Protein Folding The smallest (with respect to the number of terms) pro-
Problem details tein (ID 1) has 131 residues, 2282 terms, and 257 variables,
Protein folding [Anfinsen, 1973; Baker, 2000] is the process while the largest (ID 21) has 440 residues, 9380 terms, and
by which a protein, consisting of a long chain of amino acids, 943 variables. The average number of residues, terms, and
assumes its functional shape. The computational problem is variables is 334, 7110, and 682, respectively. The proteins
with their IDs from the paper are as follows: (1) 4JPB, (2)
4IYR, (3) 4M66, (4) 3WI4, (5) 4LN9, (6) 4INO, (7) 4J6U,
(8) 4OAF, (9) 3EEQ, (10) 4MYL, (11) 4IMH, (12) 4K7K,
(13) 3ZPJ, (14) 4LLI, (15) 4N08, (16) 2RSV, (17) 4J7A, (18)
4C2E, (19) 4M64, (20) 4N4A, (21) 4KMA.

References
[Agarwal et al., 2010] Sameer Agarwal, Noah Snavely, Steven M.
Seitz, and Richard Szeliski. Bundle adjustment in the large. In
Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors,
Computer Vision ECCV 2010, volume 6312 of Lecture Notes
in Computer Science, pages 29–42. Springer Berlin Heidelberg,
2010.
[Anfinsen, 1973] Christian B. Anfinsen. Principles that govern the
folding of protein chains. Science, 181(4096):223–230, 1973.
[Baker, 2000] David Baker. A surprising simplicity to protein fold-
ing. Nature, 405:39–42, 2000.
[Bonettini, 2011] Silvia Bonettini. Inexact block coordinate de-
scent methods with application to non-negative matrix factor-
ization. IMA Journal of Numerical Analysis, 31(4):1431–1452,
2011.
[Cassioli et al., 2013] A Cassioli, D Di Lorenzo, and M Scian-
drone. On the convergence of inexact block coordinate descent
methods for constrained optimization. European Journal of Op-
erational Research, 231(2):274–281, 2013.
[Grippo and Sciandrone, 1999] Luigi Grippo and Marco Scian-
drone. Globally convergent block-coordinate techniques for un-
constrained optimization. Optimization Methods and Software,
10(4):587–637, 1999.
[Holm et al., 2001] Jacob Holm, Kristian De Lichtenberg, and
Mikkel Thorup. Poly-logarithmic deterministic fully-dynamic al-
gorithms for connectivity, minimum spanning tree, 2-edge, and
biconnectivity. Journal of the ACM (JACM), 48(4):723–760,
2001.
[Leaver-Fay et al., 2011] Andrew Leaver-Fay, Michael Tyka,
Steven M Lewis, Oliver F Lange, James Thompson, Ron Jacak,
Kristian Kaufman, P Douglas Renfrew, Colin A Smith, Will
Sheffler, et al. ROSETTA3: an object-oriented software suite
for the simulation and design of macromolecules. Methods in
Enzymology, 487:545–574, 2011.
[Trifunović, 2006] Aleksandar Trifunović. Parallel algorithms for
hypergraph partitioning. Ph.D., University of London, February
2006.
[Triggs et al., 2000] Bill Triggs, Philip F. McLauchlan, Richard I.
Hartley, and Andrew W. Fitzgibbon. Bundle adjustment – a mod-
ern synthesis. In Vision Algorithms: Theory and Practice, pages
298–372. Springer, 2000.

You might also like