Professional Documents
Culture Documents
DISSERTATION
DOCTOR OF PHILOSOPHY
at the
POLYTECHNIC UNIVERSITY
by
Gavriel Yarmish
March 2001
Approved:
Department Head
Date
Copy No.__________
ii
Copyright 2001 by
Gabriel Yarmish
Professor of
Computer & Information
Science
___________
Date
____________________
Alex Delis
Associate Professor of
Computer & Information
Science
___________
Date
____________________
Torsten Suel
Assistant Professor of
Computer & Information
Science
___________
Date
Industry Associate
Professor of
Management
___________
Date
iv
Gavriel Yarmish was born in the United States. He was awarded his BS in
Computer Science, Magna Cum Laude, from Touro College in 1991 and his MA in
Computer Science from Brooklyn College in 1993. He is a member of Tau Beta Pi,
the engineering honor society, and has taught Computer Science and Mathematics
since 1995. The research presented in this thesis was completed between 1994 and
2001.
vi
encouragement and support during the time spent on this research and before.
vii
Acknowledgments
committee, Dr. Richard Van Slyke, for working with me throughout this long
University. Jeff Damens, our system administrator, helped install and troubleshoot
MINOS and MPI. He has also responded quickly to all network-related problems.
Torsten Suel provided input regarding various communication models. I also wish to
thank Joel Wein for the use of his lab during the early stages of my research and
Boris Aronov for his help and concern throughout the progress of this study. R. N.
Uma helped to explain the setup of MPI in the computing labs. I wish also to
acknowledge my good friend Jacob Maltz for his help with UNIX shell scripting and
Abstract
by
Gavriel Yarmish
DOCTOR OF PHILOSOPHY
March 2001
The Simplex Method, the most popular method for solving Linear Programs
(LPs), has two major variants. They are the revised method and the standard, or full
tableau method. Today, virtually all serious implementations are of the revised
method because it is more efficient for sparse LPs which are the most common.
However, the full tableau method has advantages as well. First, the full
tableau can be very effective for dense problems. Second, a full tableau method can
dense problems are uncommon in general, they occur frequently in some important
applications such as digital filter design, text categorization, image processing and
is effective for small to moderately sized dense problems. The second, a simple
extension of the first, is a distributed algorithm, which is effective for large problems
of all densities.
ix
We developed performance models that predict running times per iteration for
the serial version of our method, the parallel version of our method and the revised
method for problems of different sizes, aspect ratios and densities. We also developed
methods for choosing the number of processors to optimize the tradeoff between
Table of Contents
1. Introduction ------------------------------------------------------------------------------------ 1
2. Related Work ---------------------------------------------------------------------------------- 4
2.1 Interior point methods 4
2.2 Parallel implementations for the simplex method 4
3. Review of the Simplex Method ------------------------------------------------------------- 6
3.1 A short review of linear algebra 6
3.2 Definition of a Linear Program 7
3.3 Short description of the full tableau simplex method 10
3.4 Short description of the revised method 13
3.5 Running time comparison of the revised method and the full tableau
method 16
4. Motivation for a serial and distributed full tableau implementation -------------21
4.1 Why the method is applied to full tableau 21
4.1.1 No pricing out or column reconstruction 21
4.1.2 Alternate column choice rules can easily be used 21
4.1.3 The inverse does not have to be kept in each processor 23
4.1.4 Dense problems do not gain by use of the revised 23
4.2 Distributed computing 23
4.2.1 Why use distributed over MPP – no need for a supercomputer 23
4.2.2 Why coarse grain parallelism is used 24
5. Models and Analysis-------------------------------------------------------------------------25
5.1 Synchronized parallel pivots 25
5.2 Short sketch of one distributed pivot step 27
5.2.1 Choice of four models of parallel communication 28
5.2.2 Basic explanation of communication parameters 29
5.2.3 Analyzing basic communication operations 31
5.3 The Experimental Environment 34
5.3.1 Description of lab(s) 34
5.4 Estimates of computation parameters ucol, urow and upiv 34
5.5 Estimates of communication parameters s, g and L 36
5.6 The performance model 40
5.7 The optimum number of processors 41
5.8 The optimum number of processors with a new column division scheme 43
5.9 Estimated parallel running time for each communication model 45
5.10 Running time estimates of the revised (MINOS), serial (retroLP) and parallel
Full-tableau algorithms (dpLP) 46
5.11 Advantage of the Steepest Edge column choice rule 55
5.12 Memory requirements of revised and full tableau method 57
5.13 Asymptotic (computation/communication ratio change) analysis 59
5.14 Sensitivity to s, π and p 62
6. Implementation Choices --------------------------------------------------------------------72
6.1 Distributed programming software 72
6.2 Sockets 73
6.3 Reasons for choice of both sockets and MPI 73
6.4 The specific commands used 74
xi
Figures
Figure 3.1- Full tableau vs. the revised methods......................................................... 17
Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000) . 20
Figure 5.1- time per iteration as n increases................................................................ 45
Figure 5.2 – Time per iteration (m=1,000).................................................................. 49
Figure 5.3-Time per iteration as a function of density................................................ 51
Figure 5.4-Time per iteration as a function of aspect ratio ......................................... 52
Figure 5.5- Time per iteration as a function of m........................................................ 52
Figure 5.6 – Time per iteration as a function of p....................................................... 53
Figure 5.7 – Time per iteration as a function of p....................................................... 54
Figure 5.8 – Memory (in Megabytes) ......................................................................... 59
Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move
together................................................................................................................ 61
Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 .
Computation is changing..................................................................................... 62
Figure 5.11 – Time per iteration as a function of relative error in s ........................... 70
Figure 5.12 - p* as a function of relative error in s ..................................................... 70
Figure 5.13 - Time per iteration as a function of relative error in π ........................... 71
Figure 5.14 – p* as a function of relative error in π.................................................... 71
Figure 7.1 – Total time by generator vs. density......................................................... 81
Figure 7.2 – Time per iteration by generator vs. density ............................................ 82
Figure 7.3 – Iteration time vs. mn (classical column choice rule) .............................. 90
Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule) ............. 91
Figure 7.5-Actual timing as a function of Density.................................................... 103
Figure 7.6- Actual timing as a function of p ............................................................. 103
Figure 7.7 – retroLP vs. MINOS............................................................................... 107
Figure 7.8 – Speedup relative to MINOS (m=500, n=1,000).................................... 108
Figure 7.9 Speedup relative to MINOS (m=1,000, n=5,000).................................... 109
Figure A.1.................................................................................................................. 120
xiii
Tables
Table 5.1-Coefficient estimates................................................................................... 38
Table 5.2 – Coefficient estimates used in models....................................................... 38
Table 5.3- Expressions of the six models.................................................................... 39
Table 5.4 – p* and optimal time per iteration ............................................................. 39
Table 5.5 – Time per iteration for m=1,000 ................................................................ 46
Table 5.6- estimated running time per iteration .......................................................... 48
Table 5.7-p* and T as a function of relative error in s ................................................ 70
Table 5.8- p* and T as a function of relative error in s ............................................... 71
Table 7.1 - Netlib problems sorted by density ............................................................ 85
Table 7.2- percentage errors in both groups of problems. .......................................... 96
Table 7.3-The 24 problems used ................................................................................. 97
Table 7.4- Actual timing as a function of Density .................................................... 104
Table 7.5- Actual timing as a function of p............................................................... 104
Table A.1-Data structure for retroLP and dpLP........................................................ 117
1
1. Introduction
The simplex algorithm of linear programming has been cited as one of "10
algorithms with the greatest influence on the development and practice of science and
engineering in the 20th century," [Dongarra and Sullivan, 2000]; however, there are
popular form of the simplex method, the "revised" form, to the earlier "standard"
form (full tableau) we have been able to implement an effective coarse grained
distributed algorithm which is a simple extension of the standard form of the simplex
method. We thus reexamine the original version, the full tableau method of the
simplex algorithm. Today, virtually all serious implementations are of the revised
method because it is much faster for sparse LPs which are most common. However,
the full tableau method has advantages as well. First, the full tableau can be very
effective for dense problems. Second, as we have already mentioned, a full tableau
of the linear program among processors. This has several implications. All activities
performed on columns are essentially reduced linearly. This in turn suggests the use
of the full tableau simplex method in place of the more standard revised simplex
method. In the tableau method no processor needs to keep a copy of the basis or its
inverse. Moreover, the work of updating the columns can be done in parallel. We
wrote a serial linear programming code based on the full tableau. Then, with the
2
constructed.
The revised method is faster for sparse problems. For dense problems, though,
the full tableau method may perform better. This is because the revised method must
calculate data that the full tableau would already have. For a comparison of the
Our method is good for dense problems even when using the serial program.
One issue that must be addressed is how to give each processor enough
amortize. One way is to pick better columns by using alternate column choice rules.
These alternate rules cost more in computation than the standard rule. On the other
hand, these rules may reduce the number of iterations, see [Wolfe and Cutler, 1963],
[Kuhn and Quandt, 1963], [Goldfarb and Reid, 1977], [Forrest and Goldfarb, 1992]
and [Fletcher, 1998]. By pricing out columns in parallel, the extra cost can be
of size, aspect ratios and density. We also see which column choice rule would be
related work both in parallel implementation of the simplex method and in other
methods of solving linear programs. Section 3 gives a review of the simplex method.
3
Section 4 adds to the motivation of our work. It explains in more detail the types of
problems for which this method is applicable and how, with this method, different
column choice rules can be employed. Section 5 explains in detail the communication
models and their analysis. It shows how well our method does in comparison to other
explains how we implemented our serial and parallel method. Section 7 details
offers conclusions and possibilities for future work. Two appendices follow; one
specifies the MPS input format our linear program packages accept and the other
2. Related Work
The simplex algorithm is not the only way to solve a linear program. There
are other methods. The main competitors are a group of methods known as interior
point methods. Some interior point methods have polynomial worst case running
times, which are less than the exponential worst case running time of the simplex
method [Nash and Sofer, 1996 pgs. 269-278]. On average though, the simplex
method is competitive with these methods. With the simplex method, post-run
analysis is also possible [Nash and Sofer, 1996]. This dissertation focuses on the
simplex method.
Parallel implementations for general linear programs have been based on the
revised method [Hall and McKinnon, 1997] and the full tableau method [Eckstein et
al, 1995]. The parallel implementation of the full tableau used "stripe arrays" on the
dependent. The approach we use is coarse grained and can be applied to distributed
1988]. Stunkel and Reed also used the full tableau form of the simplex method. They
actually compared two ways of partitioning the constraint matrix amongst the
processors on a hypercube. The first way is similar to our partitioning scheme where
groups of columns are partitioned and given to different processors. The second way
A few of the differences between our paper and the hypercube implementation
are
communication.
edge rule.
d. We compared our method to the revised method for dense and sparse
problems.
machines based on the dual simplex method [Bixby and Martin, 2000].
6
Ax = b
or
åa
j =1, m + n
i, j x j = bi i = 1,..., m
We assume, for now, that A is m x (n+m) and of full rank. Thus there is a set
éx ù
We also write x = ê N ú
ë xB û
Since B is non-singular:
NxN + IxB = b, or
n
x n +i = bi − å aij x j (i = 1,2,..., m) (3.1)
j =1
7
The variables xB = [xn+1, xn+2, ... , xn+m] are basic (dependent), and the
variables in a solution.
satisfying their constraints, so that the resulting values for the basic variables using
(3.1) happen to satisfy their constraints, we say that the dictionary and the non-basic
into dictionary form. The following section continues the discussion with the addition
of an objective function.
books detailing the full tableau method. One such book is [Chvátal, 1983].
We start with:
Maximize z = cx
x
Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., n
n
x =b − å a x (i = 1,2,..., m)
n+i i ij j
j =1
We can also use this to eliminate all the basic variables in z = cx.
n
Maximize z = c0 + å c j x j
x
j =1
n
Subject to x = b ' − å a ij ' x (i = 1,2,..., m) (3.2)
n+i i j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n
Since this is the representation we will use from now on, we drop the primes
from the coefficients. The dictionary is said to be feasible for given values for x1,…,xn
if the given values satisfy their bounds and if the resulting values for xn+1,…,xn+m
satisfy theirs.
Suppose our dictionary besides being feasible has the following optimality
properties:
(i) for every non-basic variable xj that is strictly below its upper bound we
have cj ≤ 0, and
(ii) for every non-basic xj that is strictly above its lower bound we have cj ≥ 0.
non-basic variables will increase z and hence the current solution is optimal.
Details of Phase I are not discussed here. See [Nash, 1996] or [Chvátal, 1983].
Maximize c1 x1 + c2 x2 + L cn xn
Subject to a11 x1 + a12 x 2 + L a1n x n op b1
a 21 x1 + a 22 x 2 + L a 2 n x n op b2
M
a m1 x1 + a m 2 x 2 + a mn x n op bm
where op refers to any of the relations = , <= or >= and the
var iables can be bounded from above and below.
We first add slack, surplus and artificial variables to the constraints. This
Maximize c1 x1 + c2 x2 + L cn xn =
Subject to a11 x1 + a12 x 2 + L a1n x n + s1 = b1
a 21 x1 + a 22 x 2 + L a 2 n x n + s2 = b2
M O
a m1 x1 + a m 2 x 2 + a mn x n + sm = bm
where slacks <= 0, surpluses >= 0 and artificials = 0
Maximize z = c1 x1 + c2 x2 + L cn xn
Subject to s1 = b1 − a11 x1 − a12 x 2 L − a1n x n
s2 = b2 − a 21 x1 − a 22 x 2 L − a 2n xn
M
sm = bm − a m1 x1 − am 2 x2 L − a mn x n
where slacks <= 0, surpluses >= 0 and artificials = 0
The "a" values can be put into a tableau (matrix). Appendix A gives an
This and the previous section (3.1 and 3.2) explained how to put a linear
program into dictionary form. No pivots were necessary. The following section
problem is solved. This linear program uses the same procedure as explained here. It
uses a different objective on the same dictionary. This first linear program is known
as Phase I. The linear program we need to solve, which uses the real objective, is
Select a non-basic variable xj, with its cost coefficient cj>0, in equation
(3.2) that is not at its upper bound, or one with its cost coefficient cj<0 and not
at its lower bound. There may be many such eligible choices. For now, any
will work. See the discussion below for possible alternatives. If the largest
ii) The basic variable in row i is the first to violate its bound. Pivot using the
violated row.
3) Perform a pivot or move a non-basic variable from one bound to its other
bound.
The simplex just described uses the classical (Dantzig) column choice rule in
step 1. This is the most commonly used column choice rule. Other column choice
[Wolfe and Cutler, 1963] and [Kuhn and Quandt, 1963] were early studies of
column choice rules. These studies evaluated how different column choice rules
affect running time. The issue was the tradeoff between using relatively complex
(slow) choice rules to reduce the number of iterations and the resulting increased
running time per iteration. In addition to the classic rule introduced by Dantzig the
“greatest change rule" and the "steepest edge rule" were among those tested. The
results were that while the more complex column choice rules would decrease the
number of iterations, the cost of applying those rules was too great for the problems
they studied. The extra computation required took away any speed-up gained by the
reduction in pivot iterations. They performed the tests using the full tableau method.
Harris [1973], Goldfarb and Reid [1977], Forrest and Goldfarb [1992] and
Fletcher [1998] studied how to implement the Steepest Edge rule efficiently in the
revised method. They stored extra information in order to keep a recurrence formula
updated. They report an overall gain in computation speed when using the steepest
12
edge rule instead of the classic rule. Another column choice rule that is only feasible
to implement in the full tableau method is the “Greatest Change Rule." These three
general classes of column choice rules were implemented in our codes and are
Classical (Dantzig): Take the eligible column with maximum |c'j|; its
column, or order n for the entire process. In the full tableau method the current
Steepest Edge: For each eligible column, divide |c'j| by the length of the
column of A, 2
å a'i, j , and take the largest of these quotients. For increased
i
case, per column evaluated. It is important to note that this calculation is only
necessary for “eligible” non-basic variables candidates that would increase the
objective value if brought into the basis. This is discussed further in Section
7.2.1.
Greatest Change: For each eligible column, actually compute how much the
objective function would improve if the column were introduced, and select
the one which would cause the greatest change. The complexity here is
The greater amount of work for the steepest edge and greatest change rules are
more than made up for by the reduction in the number of iterations using the
standard method. However, the greatest change rule is very hard to use in the
revised method. The steepest edge is moderately difficult to implement in the revised
method; special techniques are required [Goldfarb and Reid, 1977], [Forrest and
Goldfarb, 1992].
The full tableau (matrix) method, unlike the revised simplex method, stores
and pivots on the whole tableau (matrix) of m rows and n columns. The full tableau
method uses information from the top row, which is the objective vector, for the
standard column choice rule. If other column choice rules are used additional
The Revised Simplex Method is commonly used for solving linear programs.
This method operates on a data structure that is roughly of size m by m instead of the
whole tableau. This is a computational gain over the full tableau method, especially in
sparse systems (where the matrix has many zero entries) and/or in problems with
many more columns than rows. On the other hand, the revised method requires extra
coefficients and the entering pivot column for the column choice and the row choice
respectively. These are computational costs the full tableau method doesn't have.
Maximize z = cx
x
Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., m + n
n
Maximize z = c0 + å c' j x j
x
j =1
n
Subject to x = b' − å a ' x (i = 1,2,..., m)
n+i i ij j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n
the original system together with a functional equivalent of the inverse of the basis B
rather than explicitly in the dictionary form. "Functional equivalent" means we have a
data structure which makes solving πB = cB for π and BA’ j = Aj for A’ j, easy. Aj
represents the jth column of the A matrix and cb represents the basic objective
coefficients. The data structure need not be B-1 or even necessarily a representation of
it. For example, an LU decomposition of the basis is often used, see [Nash and Sofer,
1996 pgs. 218-222]; another is to represent the inverse as a product of simple pivot
matrices see [Nash and Sofer, 1996] and [Chvátal, 1983 Ch. 7]. Given the implicit
representation, we recreate the data needed to implement the three parts of the
Select Column:
15
Symbolically, we let the component row vector π represent the multiples; that
is, we multiply constraint i by πi and subtract the result from the expression
for z. To make this work π must have πB = cB where cB, as above, represents
A.
classical Dantzig rule, the vector π must be calculated and then the inner
coefficient.
The revised method takes more effort than the standard simplex
method in this step. However, for sparse matrices, pricing out is speeded up
because many of the products have zero factors. Moreover, the revised
[Nash and Sofer, 1996 p. 222]. Partial pricing is a heuristic of not considering
all the columns during the column choice step. On the other hand, alternate
column choice rules such as steepest edge (however see [Forrest and
Goldfarb, 1992]) and greatest change are much more difficult to implement
Select Row:
16
To implement this we need b' and the column from the dictionary that
we chose in Step 1, A' j = (a'1s, a'2s, ..., a'ms)T. The b' vector will be updated
was in the original matrix A. In the revised method since we always go back
to the original matrix, we still have the original sparsity. Specifically, we have
Pivot:
3.5 Running time comparison of the revised method and the full tableau
method
The revised simplex method is especially efficient for linear programs that are
sparse and have high aspect ratio (n/m). A linear program is sparse if most of the
elements of the dictionary are 0, and it has a high aspect ratio if n/m is large.
Updating any of the representations used by the revised method is usually, at worst,
of order m2. On the other hand, pivoting on the explicit representation of the
dictionary takes order mn. Thus for high aspect ratios the standard method takes much
more work. Fortunately, in our distributed method, this work is done in parallel with
performance models for the revised and full tableau methods later in this section.
17
Figure 3.1 compares advantages and disadvantages of the full tableau vs. the
revised methods.
We will now give expressions to estimate the time per iteration of the revised
and full tableau methods. For simplicity we will count only multiplication and
division operations. The full tableau method using the classical column choice rule
requires m operations for the ratio test and (m+1)(n+1) for the pivot. Column Choice
requires n comparisons (insignificant cost) for the classical column choice rule and
(m+1)n (multiply/divides) for the steepest edge rule. Initially, we consider the
classical column choice rule and will therefore ignore column choice cost. This totals
We can safely choose the iteration time estimate based on the explicit inverse
as an upper bound on the true value, since the more exotic functional equivalents of
the basis inverse would not be used if the performance were not better.
18
the current objective row, m2 to compute the entering column, m for the ratio test and
(m+1)2 for the pivot. This totals mn+3m2+3m+1. This assumes an explicit inverse;
this can be reduced significantly for sparse problems or if we use a more effective
Assuming a dense matrix, the time per iteration (measured by the number of
multiplications and divisions) of the revised and full tableau methods are equal when
n=3m2+m. The last line of Table 5.6 in Section 5.7 shows an example. If n<3m2+m
the full tableau method should be faster. If n>3m2+m the revised method should be
faster. That means that for dense problems the full tableau method requires less
computation than the revised method for n<3m2+m. This analysis is for the revised
using the explicit inverse representation. For other representations the analysis
As just explained, the revised method pivots on a portion of what the tableau
method pivots on; m columns instead of n columns. On the other hand, it must
calculate the current pivot column and objective row from the original problem. If the
original problem has zeros the revised need not multiply those zero elements. One
cannot count on the pivot itself being shortened since the current pivot column is not
part of the original (sparse) data. Similarly the full tableau method can't take
advantage of this sparsity since it pivots on derived data instead of the original data.
Since the revised method does not do operations on zero elements, there are
potential savings for sparse problems. Let d (for density) be the average number of
19
nonzero elements. For example if we assume that on average each column has 5% of
its values nonzero, then d=5%. Determining π, pricing out and calculating the
entering column will take about dm2, dmn and dm2 respectively. The pivot and ratio
tests still require roughly the same number of operations as before: m and (m+1)2
respectively. The total running time of a sparse problem on the revised method using
equals the running time of the full tableau method when n=m. That means that in
complete sparseness the revised is faster as soon as n gets larger than m, which is the
usual case. A similar discussion to this one can be found in Nash and Sofer [Nash and
For example, assume m=1,000 and there is 5% density (d). The running time
50n + 100,000 + 1,000,000 + 3,000 + 1 = 50n + 1,103,001 . The running time of the
tableau method is 1,000n + 2,000 + n + 1 = 1,001n + 2,001 . The running times of the
m(m(2d + 1) + 1)
revised and full tableau methods are equal when n = . When n=1,158
m(1 − d ) + 1
the revised and tableau methods take about the same time. When n<1,158 the tableau
method is faster. Once n>1,158 the revised method is faster. Figure 3.2 is a graph that
shows for m=1,000 at what n the revised method overtakes the tableau method. This
4500
4000 3994
3500 3450
3000 2997
2613
2500
2284
n
2000 1999
1749
1500 1529
1333
1158
1000 1000
500
0
0 0.1 0.2 0.3 0.4 0.5
Density, d
Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000)
21
Aside from the two early papers mentioned, recent research has focused on
the revised method. We use the full tableau method. There are a number of reasons
for this. The two main reasons are that it is better for dense problems (Section 4.1.4)
and that a coarse grained distributed program is straightforward (Section 4.1.3). More
specifically:
(price out) the objective row. It also requires extra computation to calculate the
column entering into the basis. Using the full tableau method avoids this computation.
Another advantage of using the full tableau method is the possibility of using
multiple column choice rules. These were briefly mentioned in Section 2.1 and in
Section 3.3. The classical (Dantzig) rule simply selects the column with the maximum
coefficient in the objective row of the updated tableau. It is easy to use this rule. The
cost of deciding which column will enter the basis, using the Dantzig rule, is only n
Other column choice rules, on the other hand, require the values of the
The full tableau method allows the use of these other column choice rules
without the extra computational cost of recreating those columns. Its only additional
22
cost is m multiplications per column. This computation is amortized over the different
processors in dpLP.
The revised, on the other hand, does not keep the updated column on hand.
(Note the m2 cost of computing the entering column.) To use these other rules the
revised must recompute every column (not only the entering column) from the
inverse, instead of just pricing out the objective row. In addition to the mn cost of
pricing out it would now cost m2n to reconstruct all n columns. Reid, Goldfarb and
Forrest [1977,1992] addressed this problem for the Steepest-edge rule. They used a
recurrence formula to update the steepest edge direction (norm). There is a substantial
cost to initializing this formula. In addition, each iteration takes longer due to the cost
of updating this recurrence formula. They have reported that using this Steepest Edge
rule generally cuts down the total computation time for any problem size; that is, it
reduces the number of iterations enough to more than compensate for the added per-
iteration cost. A few of the rules that can be used with the full tableau method
include the greatest change method and different gradient methods. Wolfe-Cutler
[1963] and Kuhn-Quandt [1963] have studied these column choice rules using the full
tableau method. They each took rather small linear programming problems and ran
them using various column choice rules. They calculated the time per iteration and
the number of iterations for each run. From these runs they concluded that on average
the greatest change method and different forms of the steepest edge method result in
fewer iterations. The extra cost per iteration, though, was too costly and made the
algorithm as a whole slower when used with these alternate rules. We did similar tests
on our serial implementation on large problems and did find a gain in computation
23
time for alternate rules. We compare the standard rule to the steepest edge rule in
Sections 5 and 7.
Our distributed method divides the columns amongst the processors. Using
the revised method requires part of the tableau to be recreated from the inverse each
iteration, which makes a parallel method difficult. At least one processor would need
a copy of the whole m-by-m inverse or its functional equivalent. The full tableau
method does not need to recreate any row or column and can simply hold as many
columns as it wants without the extra inverse overhead. Using the full tableau method
makes it unnecessary for any processor to carry the inverse of the basis.
Our method works for all problems, but it performs best when used on dense
problems. This is because the revised method is slower than the full tableau method
for dense problems even on one processor, assuming n is not too much greater than
m, as was noted in Section 3.3. For dense problems there is no point in using the
revised method. Applications that give rise to dense problems are given in Section
8.2.
loosely connected network. They do not share memory but communicate via message
supercomputer. It can also be run on any network of workstations that is not using all
tradeoff between communication time and the time spent by the processors doing
computation time ratio in message passing systems is much higher than that for
processors. There is no overlap of columns. All vectors that hold information for the
non-basic variables are also divided. The x vector, which holds the values of the non-
basic variables, is therefore divided. The b vector, which holds the values of the basic
processor calculates the best candidate for the pivot column from among its columns,
using one of the column choice rules. For each column choice rule that we have
discussed, a numerical value is assigned to each column by the rule, and the column
with the maximum value is chosen by the rule. In parallel, each processor looks at its
columns. The best of these values determines the local column chosen. This column
is that processor’s proposal. A coordinating processor then calculates the best value
from among the proposals. In our (first) implementation, only one column is proposed
per processor although generalizations to multiple proposals are easy to design (see
Section 8.3). The processor with the winning proposal sends its column to all
processors who then pivot on their columns (explained below). This pivot is the last
1. Initialization
processor must be sent the initial value of the basic variables, b, and a
each).
2. Column Proposal
Every iteration, each processor must make known the value of its best
to each processor.
4. Finalization
Since Steps 1 and 4 are executed only once, we therefore give them less
processor. In the centralized approach this processor also coordinates Steps 2 and 3.
That is, each processor sends the value of its column proposal to the coordinator. The
coordinator selects the winner, and tells the winning processor to send its column to
all the other processors. In the decentralized approach, each processor broadcasts its
proposal to all the other processors. Simultaneously, all processors can determine the
winner; specifically the winning processor is able to determine that it is the winner. It
can then broadcast its column to all the processors. No coordination is required.
area network can facilitate this processing. For example, the algorithm requires only
see Section 6.2.). Step 2 using the centralized approach requires p point-to-point
transmissions of essentially one double to the coordinator. The coordinator then sends
a) Each processor selects and sends its local best column to the main
b) The processor with the global max selects the variable (row) to leave the basis
and ships a copy of its winning column together with its row choice to every
processor.
ii) Row i's constraint is the first to be violated. We then pivot using the
violated row.
model proposed by Culler et al [1993]. Another was the BSP model proposed by
Valiant [1990]. In addition to the previous two we looked at simple models assuming
the use of local area networks (LAN’s) such as Ethernets or Token Rings for
communication. Most LAN’s are intrinsically broadcast devices, however not all
software for using them in distributed computing takes advantage of that capability.
We considered LAN communication both with and without broadcast primitives. The
LogP model is a model for asynchronous computation whereas the BSP is a model for
algorithm to see how synchronous it is. We also had to determine the size of the
matter of whether an algorithm uses coarse or fine grain parallelism influences the
choice of a communication model. All the models were used in the analysis, although
29
we modified the LogP model slightly. In our testing, detailed in Section 7, only the
Ethernet with broadcast is used. The following paragraphs explain this in detail.
As just mentioned, four parallel models were used to analyze the program.
These will be listed later in this section in Table 5.3. The first is a model of Ethernets
using a broadcast primitive. The second is a model of Ethernets not using a broadcast
primitive. These were chosen due to the common use of these topologies. The other
two models assume a completely connected topology. They are the BSP model and
another commonly used (CU) model. This commonly used model will be used instead
of the LogP model for two reasons. The LogP model is complex and it overestimates
the running times for programs that send large messages. Our program broadcasts
vectors which can be quite large and which can cause LogP to give bad estimates.
supersteps. Our algorithm has a few supersteps which makes BSP an interesting
model to use. By using both these models, we can see how long it should take both on
1,000 and p to be 100. The discussion would also apply to all m and p.
items are to be sent across the network within one message it takes s+1000g time. s
represents the startup time (latency) and g represents the items/sec (throughput). No
broadcast primitive, sending the same 1000 items to p processors takes s+1000g; if
30
broadcasts are not supported then (99)(s+1000g) time is required. We ignore the
effects of collisions and retransmissions, and assume the Ethernet is lightly loaded.
The LogP model has three parameters: l, o and h. The definition of these
parameters is subtler than for those above. The parameter l is the time it takes for an
item to go from processor to processor over the network. Typically this is extremely
short. o is the operating system time taken by a processor to send or receive an item.
To send one item would cost l+2o. h is the time the processor must wait before
sending the next item. In this model, processor A can begin sending to another
processor C after sending to processor B before B has completely received its data.
Whether to send 10 items from one processor to another or to send one item from one
items as a group will be faster on most architectures, which is why LogP will
The commonly used model (CU) has 2 parameters: s and g. These are the
same as in the Ethernet model. The difference is that different groups of processors
can communicate to each other simultaneously, whereas in the Ethernet only one
processor can successfully communicate at a time. Unlike the LogP model, in the CU
model a processor can't begin a session until its previous session is finished. To send
The BSP model has 2 parameters: L and g. g is the same g as in the last model.
synchronized period called a superstep one takes the maximum number of items any
broadcast facility. Our program has two basic communication segments that
Bcast, from MPI [Snir, 1996]. MPI is a parallel communications package that allows
Section 6 describes MPI as well as the original reason for its choice as the parallel
package before sockets were used. For ease of reference, the two MPI primitive
names will be used for what we implement using sockets. Below is a brief analysis of
the Bcast and Allreduce MPI primitives. For a more in-depth analysis see Karp et al
Allreduce gets one element from each processor to a "root" processor. (This
first step is called Reduce.) This “root” calculates the maximum of these and
broadcasts the maximum to all the other processors. For a completely connected
network topology, Reduce has been shown to take the same time as Bcast since it
involves messages between the same endpoints in the opposite direction [Karp et al,
1993]. Allreduce takes at most 2*Bcast time. This bound is easily achieved by
On an Ethernet, such as our network, Reduce is slower than Bcast because all
can listen at once, but only one can transmit. Although we assume in our expressions
that Reduce takes O(p) time for p processors note that Martel [1994] found O(log p)
Depending on the algorithm used to implement it, it can take different amounts of
time.
Ethernet broadcast primitive is being taken advantage of or not. Allreduce takes (p-
For the BSP the Allreduce takes L+2(p-1)g. (This analysis assumes that the
Reduce and Bcast that are implemented within the Allreduce are done sequentially
“superstep.” We show how to Bcast m items both using one superstep and using two
which is why we won't use a logarithmic tree algorithm such as in the CU model (see
the next paragraph). In the first algorithm the root sends the m items to the p-1
processors. This takes L+(p-1)mg time. In the second algorithm, the root splits the m
items into p parts of size m/p. It then sends a different part to each of the p-1
m
processors in the first superstep. This takes L + ( p − 1) time. In the second
p
superstep, each processor sends its part to the other p-2 processors (excluding itself
33
and the root). The root sends its portion to p-1 processors and receives the same. This
m
takes L + ( p − 1) g time. If you add them together you get approximately 2L+2mg
p
For the commonly used model (CU), Allreduce takes 2(s+g)(log2 p). This can
A Reduce is the command that gathers information from each processor to one “root”
processor. This is the opposite of a Bcast. To broadcast, a processor sends one item to
two other processors who in turn recursively each send to another two processors.
This is a cost of (s+g)(log2 p). It has been shown that an Allreduce performed using a
processors. This takes (s+mg)(log2 p) time. The second algorithm splits the m items
into p/2 pieces of size 2m/p. A sends piece one to B, B sends it to C and so on. As
soon as C gets that piece (B is ready for more) A sends the next piece to B. The last
piece leaves A after p-1 time and arrives at the last processor after 2p-3 time.
2m 2m
p p
Pipeline : A → B → C → D L This takes
æ m ö m
çç s + 2 g ÷÷(2( p − 3) ) = 2 ps + 4mg − 6s − 12 g . This second algorithm can also be
è p ø p
extended from a simple pipeline to using a log2p tree, similar to the first algorithm.
34
More detailed explanation of the BSP and log2 p models can be found in Goudreau et
These MPI primitives are used in the steps described in Section 5.1. Allreduce
is used for step 2 “Column Proposal” and Bcast is used for step 3 “Pivot Column
as opposed to broadcast, which affects all processors on the LAN, only sends the
workstations. There are 7 identical Sun Ultra 5 Workstations (270 Megahertz), each
with 128 MB RAM, all running Solaris 5.7. A single 100-megabit per second
Other workstations that weren’t identical were not used. During testing it was
Experiments for the serial version, retroLP, were also performed on a PC. It is
a Dell 610MT Workstation with 384 MB RAM. It has a Pentium II processor running
integrated L2 cache. The PC environment was used for the results given in Section
7.3.
For the computation part, the times for division, multiplication and
15,000 operations. The running time of this loop was then divided by 15,000. All
optimization was turned off in the compilation. In addition, a large array was
allocated. Each element of this array was operated on. This scheme does not allow the
compiler to optimize. It also matches the way operations are performed in a doubly
subscripted array (the tableau). A problem with using the estimates of division,
multiplication and comparison is that the row choice is not only divisions, and the
column choice is not only comparisons, and the pivot is not only multiplications. To
get a closer estimate, the functions for the three different parts of the simplex method
were called for a matrix of size m=1,000 by n=10,000. The three parts are column
choice, row choice and pivot. Note that in practice, many columns are not looked at in
detail within the column choice. It is only necessary to look at the sign of the cost
coefficient and the bound of the non-basic variable. In one empirical test we found
that only 35% of the columns were eligible candidates for the basis; the other
in reference to the steepest edge column choice rule where this observation makes a
substantial difference. The resulting times were then divided by m=1,000, n=10,000
and mn=10 million for row choice, column choice and pivot respectively. This gave a
"unit" time for that part of the algorithm. This unit is in a sense the amortized time of
all the multiplication and division operations, as well as any other part of the
calculation. These units are listed in the last three columns of Table 5.1. The running
time for any other size problem can be estimated by multiplying the unit row choice
time by m, the unit column choice time by n and the unit pivot time by mn. Although
36
the timing estimates for division, multiplication and comparison given in Table 5.1
are not used in the calculation, they are provided for reference.
run the actual program on many different problem sizes using differing numbers of
processors. We then tabulate a list of actual computation times taken by the program
runs. Linear regression is then used to estimate the coefficients of our timing
expressions. This is discussed further in Section 7 where we discuss how well our
packet many times; in our experiments, about 10,000. We sent the packet from A to B
then from B back to A [Dongarra and Donigen, 1996]. This is called Ping-Pong. We
then took the total elapsed time and divided by 20,000 to get the per communication
length 8,000 (64Kbytes). We then took the total elapsed time and divided it by 8,000.
We ignored startup time since it should be small relative to the transmission of a large
workstations described in Section 5.3. s and g were calculated for use in the
case regression yielded better results than the results yielded via direct
Each row of Table 5.1 corresponds to one estimate. This was estimated a
number of times. The average is in the bottom row. Referring to Table 5.1 at the
choice is 1.65*10-6 seconds, the unit pivot is 1.24*10-7 seconds and the unit column
and p.
upiv, which is the coefficient for the pivot step. The second is ucol_se, which is the
coefficient for the column choice rule when the steepest edge column choice rule is
being used. The final coefficient values upiv and ucol_se, used in the formulas for the
6 different communication models, are listed in Table 5.2. These 6 models are listed
s g division multiplication comparison unit row choice(urow) unit pivot(upiv) unit col choice(ucol) unit col choice(ucol_se)
1.76E-03 1.23E-06 1.27E-07 1.09E-07 3.86E-09 1.66E-06 1.24E-07 3.73E-08 3.41E-0
1.74E-03 1.82E-06 1.35E-07 7.41E-08 3.80E-09 1.65E-06 1.25E-07 3.74E-08 3.43E-0
2.46E-03 1.88E-06 1.31E-07 7.38E-08 3.80E-09 1.63E-06 1.26E-07 3.72E-08 3.43E-0
2.35E-03 1.81E-06 1.38E-07 7.39E-08 3.80E-09 1.66E-06 1.24E-07 3.72E-08 3.42E-0
2.05E-03 1.85E-06 1.31E-07 7.77E-08 3.80E-09 1.71E-06 1.24E-07 3.71E-08 3.41E-0
2.64E-03 1.84E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.24E-07 3.77E-08 3.43E-0
1.94E-03 1.82E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.26E-07 3.72E-08 3.43E-0
1.88E-03 1.84E-06 1.33E-07 7.37E-08 3.80E-09 1.63E-06 1.24E-07 3.71E-08 3.42E-0
2.10E-03 1.76E-06 1.32E-07 7.87E-08 3.81E-09 1.65E-06 1.24E-07 3.73E-08 3.42E-0
upiv 1.24E-07
ucol_se 3.42E-07
ethernet ethernet St. edge ethernet fully connected fully connected fully connected fully connected
activity broadcast broadcast no broadcast common model common model BSP BSP
Alg 1 Alg 2 Alg1 Alg 2
comp. get local max (n/p)ucol (mn/p)(se_ucol) (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol
comp. winner calcs. Pivot row (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow
comm. Bcast column + int (s+(m+1)g) (s+(m+1)g) (P-1)(s+(m+1)g) (s+mg)(log P) 2PS+4mg L+(P-1)mg ~2L+2mg
eth-broad St. edge 15.5 0.008733838 49.1 0.021736331 155.1 0.062869419 346.9 0.137229171 490.6 0.192948601
eth-nobroad 2.5 0.009358383 7.7 0.031334104 24.5 0.100926097 54.7 0.226745932 77.3 0.321026613
comm. Mod alg1 4.6 0.005184726 45.8 0.007063647 457.3 0.008949164 2286.5 0.010267533 4572.9 0.010835344
comm. Mod alg2 26.4 0.022746269 31.3 0.029126954 62.2 0.063022539 129.4 0.129994516 181.4 0.180583468
BSP alg1 13.3 0.023429024 15.8 0.034337139 31.3 0.090299238 65.3 0.19768193 91.6 0.278111702
BSP alg2 64.7 0.011540454 203.8 0.014284651 644.1 0.018843167 1440.2 0.024959694 2036.7 0.029115974
Assuming we use the same column choice rule whether we use the full tableau
method or the revised method, the number of iterations should be the same. Our
analysis therefore focuses on the timing within an iteration. To get the total running
time the number of iterations can then be multiplied by the value calculated. We are
careful to only include the time spent solving the problem; e.g., the time taken to read
Section 5.2 gave a short sketch of one distributed pivot step. Step 1 in Section
5.2 consists of computation within each processor of its local maximum followed by
communication of the maximums to the coordinating processor. The timing for these
actions is given in the first two rows of Table 5.3. Step 2 in Section 5.2 consists of
computation within the “winning” processor of the leaving basic variable (row
choice) and the communication via broadcast of both its pivot column together with
its row choice. The timing for these actions is given in rows 3 and 4 of Table 5.3.
Finally the timing of the pivot within Step 3 of Section 5.2 is given in the last row of
Table 5.3. Note that this analysis assumes that pivot steps will be performed every
iteration.
urow, ucol, upiv, s and g are constants as in the previous section. se_ucol is
the constant for the steepest edge column choice rule. This constant is close to upiv in
value. In rows 1, 3 and 5 each processor goes through its n/p columns, m rows and
41
n +1
(m + 1) matrix elements respectively. Computation calculations similar to those
p
in rows 1, 3 and 5 in Table 5.3 can be found in Nash and Sofer [1996, pgs. 114-116].
The communication rows (2 and 4) of Table 5.3 were explained in Section 5.2.2.
Each column of the table corresponds to a model. To get the complete running
time of the program for a given model, simply sum that column and multiply by the
number of iterations. This assumes the classical column choice rule was used with a
pivot in each iteration within Step 3 of the steps listed in Section 5.2. The second
column is the only column that assumes the steepest edge rule. The only change from
and computation time. As the number of processors increases, the computation time
decreases, but the communication time increases. For each communication model, we
can estimate the optimal number of processors to use by taking the derivative of the
timing function of the algorithm with respect to p. This is then set to 0 and solved for
given formulas with our estimated values of s, g, ucol, urow, upiv with m = 100 and n
n æ (n + 1)(m + 1) ö
f ( p) = ucol + 2(p - 1)(s + g ) + m * urow + ( p − 1)(s + (m + 1)g ) + çç ÷÷ upiv
p è p ø
df -n * ucol + (n + 1)(m + 1)urow
= + 3s + 3 g + mg = 0
dp p2
Table 5.4 gives the predicted optimal number of processors for m = 1,000
The following expressions generalize this example. The expressions drop the
units (m instead of m+1). γ, ρ, π and Γ are ucol, urow, upivot and se_ucol
respectively. p* is the optimum p. Three expressions are given. The first corresponds
to the Dantzig rule assuming Ethernet with broadcast as in our example above. The
second corresponds to the Steepest Edge rule again assuming Ethernet with broadcast.
The third expression corresponds to the Dantzig rule but the communication model is
γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
Dantzig Rule:
γ n + π mn
p* =
s+g
Γm n π mn
T= + ρm+ + s + gm + ( s + g ) p
p p
Steepest Edge:
m n (Γ + π )
p* =
s+g
γn π mn
T= + ρm+ + [ s + gm] p + ( s + g ) p
p p
Linear Broadcast:
γ n + π mn
p* =
2s + g (m + 1)
43
5.8 The optimum number of processors with a new column division scheme
Classic rule
The optimum p* derived in the previous section assumed that columns were
equally divided among the processors. After the processors each calculated their best
communication took a total of ps time; one proposal for each processor. If we can
Where n0 is the number of columns every processor has as a base and k is the
additional columns a processor gets over the previous processor. k should be the
s + g = (πm + γ )k Þ
s+g
k=
πm + γ
Notice that now instead of a cost of ps there is only the cost of the s of the last
This scheme’s p* is a factor of the square root of 2 of the old scheme’s p*.
As an example take a problem where m=1,000 and n=5,000. The first scheme
calculates
p* = 53.53
n
n0 = = 93.40
p
k =0
T = .02347 seconds per iteration
p* = 75.95
n0 = .866
k = 1.73
T = .01761 seconds per iteration
All of the calculations and experiments in this paper used the first scheme.
The last scheme is mentioned in the future work section (Section 8).
A similar analysis is provided for the steepest edge column choice rule.
s + g = (πm + Γm)k Þ
s+g
k=
πm + Γ m
This scheme’s p* is also a factor of the square root of 2 larger than the old
scheme’s p*.
45
Figure 5.1 shows the time per iteration as n increases. Both schemes are
0.5000
0.4500
0.4000
time per iteration
0.3500
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
-
0
00
00
0
00
00
00
00
00
5
0,
4,
5,
10
25
50
75
10
n
Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2
Table 5.5 consists of the estimated timing for all the communication methods.
We use the results of Table 5.4 as the number of processors (p) to plug into the
expressions in Table 5.3. In Table 5.5, m is set at 1,000. It is now easy to estimate the
5.10 Running time estimates of the revised (MINOS), serial (retroLP) and
Table 5.6 is based on the models of Sections 3.4, 5.5-5-8 and uses the
coefficients given in Table 5.2. Table 5.6 compares the estimated running time per
iteration of three algorithms for problems of varying size. The three algorithms we
compared are our serial full-tableau simplex method, the revised method and our
47
parallel full-tableau simplex method. The parallel simplex in the table assumes an
Ethernet with broadcast and the optimum number of processors. Two sets of optimal
dividing up the columns amongst the processors and the scheme proposed in Section
5.8. Both the serial time and the parallel time are also shown when the steepest edge
column choice rule is used. Times per iteration for the revised method were
estimated, assuming both a dense (100%) tableau and a sparse (5%) tableau. Both
densities were not shown for the full tableau method because density has very little
effect on the running time of the full-tableau algorithms whereas the revised method
runs more quickly on sparse problems. The last two columns of Table 5.6 show the
optimum number of processors to use when employing the standard and steepest edge
sensitive to this approximation (see Section 5.14). Note that the revised method, for
completely dense problems, is slower than the full tableau for all n in the table (aside
from the last line). As n rises, the revised method catches up at a very slow rate. The
revised method catches up when n is equal 2m2+m as calculated in Section 3.3. This
m n serial serial S.E. Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2 revised dense revised sparse p*-standard p*-St.Edge
5,000 4,500 2.80 10.50 0.0601 0.0468 0.1038 0.0776 12.13 3.56 120.13 232.68
5,000 5,000 3.12 11.67 0.0627 0.0486 0.1087 0.0811 12.44 3.58 126.63 245.26
5,000 10,000 6.22 23.33 0.0830 0.0629 0.1481 0.1089 15.55 3.73 179.07 346.85
5,000 25,000 15.55 58.32 0.1233 0.0915 0.2262 0.1642 24.87 4.20 283.13 548.41
5,000 50,000 31.09 116.64 0.1688 0.1236 0.3143 0.2265 40.41 4.97 400.40 775.56
5,000 75,000 46.63 174.95 0.2037 0.1483 0.3819 0.2743 55.95 5.75 490.39 949.87
5,000 100,000 62.18 233.27 0.2331 0.1691 0.4389 0.3146 71.49 6.53 566.25 1,096.81
10,000 9,000 11.20 42.00 0.1203 0.0933 0.2076 0.1550 48.50 14.24 240.24 465.34
10,000 10,000 12.45 46.66 0.1253 0.0968 0.2173 0.1619 49.74 14.30 253.23 490.51
10,000 20,000 24.88 93.31 0.1660 0.1256 0.2961 0.2176 62.17 14.92 358.12 693.68
10,000 50,000 62.18 233.27 0.2467 0.1826 0.4524 0.3281 99.47 16.79 566.22 1,096.80
10,000 100,000 124.34 466.52 0.3376 0.2470 0.6286 0.4527 161.63 19.89 800.76 1,551.10
10,000 150,000 186.51 699.77 0.4074 0.2963 0.7638 0.5483 223.79 23.00 980.72 1,899.71
10,000 200,000 248.67 933.02 0.4663 0.3379 0.8778 0.6289 285.94 26.11 1,132.44 2,193.59
10,000 500,000 621.66 2,332.54 0.7215 0.5184 1.3721 0.9785 658.89 44.76 1,790.54 3,468.37
10,000 1,000,000 1,243.31 4,665.07 1.0091 0.7217 1.9293 1.3724 1,280.48 75.84 2,532.21 4,905.02
10,000 300,010,000 373,000.75 1,399,562.96 17.0359 12.0538 32.9740 23.3240 373,000.75 18,661.86 43,859.85 84,958.82
Table 5.6- estimated running time per iteration
49
80.0
70.0 serial
60.0
40.0
30.0
revised
dense
revised
10.0 sparse
-
4,500 5,000 10,000 25,000 50,000 75,000 100,000
n
In Figure 5.2, the 3 algorithms are compared. The figure corresponds to the
top part of Table 5.6 where m is 5,000. The x-axis is n and the y-axis is the estimated
time per iteration in seconds. This comparison is for dense problems. The revised is
even slower than the standard simplex. This is due to the extra computation needed to
calculate the objective row and the pivot column. As just mentioned, the figure shows
the revised method slowly catching up to the full tableau method as n increases. It
takes such a large column size to catch up that it seems reasonable to say that for all
practical dense problems the revised method is slower than the standard method.
Moving over to the Ethernet based parallel algorithm one can see that the added time
for a higher n is minimal. It is only the cost of sending a larger vector over the
Ethernet, which is a very small cost for an Ethernet with a broadcast facility.
There are four variables to deal with when comparing the different methods.
They are Aspect Ratio (AR=n/m), size (m), density (d), and number of processors (p).
Figures 5.3, 5.4 and 5.5, compare the serial full tableau method using the
standard column choice rule, the serial full tableau method using the steepest edge
column choice rule and the revised method. They compare them as density, aspect
ratio and m are varied, respectively. The base problem is m=100, p=1 (serial method),
Figure 5.6 is the fourth graph of the group. It shows a comparison between the
parallel full tableau method using the standard column choice rule, the parallel full
tableau method using the steepest edge column choice rule and the revised method.
The parallel method uses an Ethernet with broadcast. We also include the Ethernet
Figure 5.7 is the same as Figure 5.6 except that it is for a larger problem
where m=10,000 and n=100,000. We graphed this in order to show a problem for
which the parallel method using a small number of processors would actually
In both Figure 5.6 and Figure 5.7 the point on each curve that is the lowest
time per iteration is where the optimum p (p*) is being used. The p* values for the
eth-broad 75.7
eth-nobroad 20.1
Notice that these numbers can be rounded to whole integers. From the graphs
one can see that the time per iteration for Ethernet using broadcast is not very
0.05
0.05
0.04
0.04
time per iteration
0.03
0.03
0.02
0.02
0.01
0.01
0.00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.20
Aspect Ratio
revised serial Serial S.E.
1.80
1.60
1.40
time per iteration
1.20
1.00
0.80
0.60
0.40
0.20
0.00
50 100 200 300 400 500 600 700 800 900 1000
number of rows, m
revised serial Serial S.E.
5.0E-02
4.5E-02
4.0E-02
3.5E-02
3.0E-02
2.5E-02
2.0E-02
1.5E-02
5.0E-03
0.0E+00
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
number of processors, p
5.00E-01
4.50E-01
4.00E-01
3.50E-01
3.00E-01
2.50E-01
2.00E-01
1.00E-01
5.00E-02
0.00E+00
1
8
6
3
0
7
4
1
8
5
15
22
29
36
43
50
57
64
71
78
85
92
99
10
11
12
12
13
14
14
15
number of processors, p
For sparse problems the revised is much quicker than the serial full tableau
method. Our distributed method when used with the optimal number of processors on
an Ethernet with broadcast is even faster than the revised method. We thus have two
important parameters that determine whether the revised method or our full tableau
(retroLP and dpLP) algorithms are faster. One parameter is density and the other is
the number of processors that we have. If we have a completely dense problem our
method is faster. If we have the optimum number of processors our method is often
faster even on sparse problems. This can be seen in Figures 5.6 and 5.7. The figures
refer to problems with only 5% density. The parallel algorithm on an Ethernet with
broadcast comes extremely close to the time of the revised, for a problem size of 100
x 1,000 (figure 5.6) for its optimum 7.6 processors. For a problem size of 1,000 X
10,000 the parallel method would overtake the revised if it had 7 processors, much
lower than its optimum. This is shown in Figure 5.7. The optimum p for that example
is about 76 and 144 processors for the standard and steepest edge rules respectively.
Even in our first example a slightly higher density would already cause the revised to
be slower. In practice our method can therefore be used in two cases. One case:
applications are given in Section 7. The second case occurs whenever we have access
These comparisons assume all methods are using the classical Dantzig column
choice rule. The tableau method and especially the distributed method can make very
effective use of alternative column choice rules to further improve the relative
56
performance. One of these rules is the steepest edge rule. As mentioned in Section
4.1.2 it is not necessary to re-compute any columns in order to perform this rule. It
simply requires, at most, an extra mn multiplications within the column choice step.
The steepest edge rule is included in these tables and in Figures 5.6 and 5.7. Notice
how expensive it is to use with only one processor and how quickly it speeds up as
the processor number increases. Here we only compare time per iteration. The gain of
the alternate column choice rules is in the fewer number of iterations that are
necessary. This extra computation is more than offset by the reduction in the number
of iterations [Harris 1973; Goldfarb and Reid 1977; Forrest and Goldfarb, 1992]. See
Section 7.4 for experimental results supporting this view. This extra computation is
shared amongst the processors just like the rest of the computation. These rules can
be faster than the classical rule even without the use of multiple processors. This
Note that the expressions and all the graphs assume that every column is
looked at using the steepest edge column choice rule. This in fact is not true. In one
empirical test we found, on average, that only 35% of the columns were eligible and
therefore looked at. All the charts and graphs assume all columns were eligible. This
is an upper bound. The steepest edge does even better in practice than that which is
An expression for time per iteration using the steepest edge rule was provided
in Section 5.8. As the number of processors increases, the iteration time of the
steepest edge rule becomes more competitive with the iteration time of the standard
rule. It cannot actually catch up to the speed of the standard rule per iteration. We
57
would like to calculate the percentage fewer iterations needed to make the steepest
edge overtake the standard column choice rule. The expression for this is simply time
per iteration of the standard rule divided by time per iteration of the steepest edge.
For a problem of size 1,000 by 10,000 (Figure 5.7) the percentage when p=1 is
27.89%. This means that the steepest edge must iterate 27.89% of the iterations
performed by the standard method in order for it to catch up in speed. The higher the
percentage, the better it is for the steepest edge method. For 76 processors (the
optimum for the standard method) the percentage is 46.04%. For 144 processors (the
optimum for the steepest edge) the percentage is 65.92%. Section 7.3 shows how
many iterations were actually performed by the steepest edge method in practice.
The full tableau holds a data matrix of m+1 by n+1 double precision floating-
point numbers. It also has a few auxiliary vectors, which are not included in this
calculation. (6 vectors of size m and 6 vectors of size n.) A double precision element
is 8 bytes. That amounts to 8(m+1)(n+1) bytes. A problem with table size 100x100
(m=100, n=100) takes 81,608 bytes and a problem size of 1,000 X 10,000 takes
memory is the memory required when using only one processor. The revised method
requires (m+1)(n+1) elements for the original data. In addition it requires a (m+1) by
(m+1) matrix for the inverse of the basis assuming an explicit basis inverse is
maintained. It also has an extra vector of size m+1 for the pivot column, which is not
8[(m + 1)(n + 1) + (m + 1)(m + 1)] bytes. A problem with table size 100x100 takes
163,216 bytes and a problem size of 1,000 X 10,000 takes 88,104,116 bytes. Figure
5.8 is a graph of memory requirements in megabytes for both the tableau method and
the revised method as m gets larger and assuming n=10m. Notice that for dense
problems with this aspect ratio, the revised method takes approximately 10% more
1 1
1+ 1+
(m + 1)(n + 1) + (m + 1)(m + 1) m +1 m = 1+ m
= 1+ = 1+
(m + 1)(n + 1) n +1 n 1 1
+ 10 +
m m m
This assumed a direct inverse representation of the basis for the revised
simplex method. Functional equivalents would take a similar amount of storage in the
dense case. For sparse problems the revised method saves a lot of space by using
10,000.00
9,000.00
8,000.00
7,000.00
6,000.00
memory full tableau
5,000.00
revised
4,000.00
3,000.00
2,000.00
1,000.00
-
0
00
00
00
00
00
10
10
30
50
70
90
m
The coefficients used here were based on the network described in Section
5.3. It is interesting to see how it would fare on more current networks and on
networks in the future where the computation to communication time ratios may
change.
one double (for a byte it would be .425*10-6). The computation operations for a
double floating point number are of order 10-7. These parameters include unit column
choice for steepest edge, unit pivot, multiplication and division. For simplicity all
computational operations in this analysis will be assumed to be 10-7 even though there
Figure 5.9 and Figure 5.10 show the asymptotic change in the speedup of our
parallel program, when exactly 7 processors are used, as the ratio of the computation
time to communication time changes. The leftmost points in both figures show the
current speedup assuming the current speeds of s, g and computation. Each point
along the horizontal axis assumes that the communication or computation speed, for
Figures 5.9 and 5.10 respectively, doubles from the speed of the point to its left.
Figure 5.9 shows what happens when both s and g get faster and the computation
speed is held constant. On each subsequent point the speed of both s and g are
doubled. Figure 5.10, on the other hand, shows what happens when the computation
speed gets faster and both s and g are held constant. On each subsequent point the
on an Ethernet with broadcast and where the optimum p, p*, is used in dpLP.
From the figures we can see that the speedup is affected very much by
take advantage of more processors since the communication costs are lower. On the
increases. This causes a decrease in speedup since we can use fewer processors.
61
350.0
300.0
250.0
speedup
200.0
150.0
100.0
50.0
0.0
1 2 4 8 16 32 64 128
speedup
Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move together
62
30.0
25.0
20.0
speedup
15.0
10.0
5.0
0.0
1.000 0.500 0.250 0.125 0.063 0.031 0.016 0.008
computation time/communication time ratio scaled
speedup
Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 . Computation is
changing
small changes in the parameters. This is important because, before the program is run,
our timing expressions tell us how many, p*, processors to use. We then know the
p*, if only changed slightly, does not significantly affect the overall running
time. This suggests that we can round p* to the nearest whole number without
difficult to use the optimum p* when it is a large number. We would like to know
63
how much the timing would be affected if we use a few less processors than the
optimum.
Assume that we round the optimal number of processors, p*, to its nearest
integer pint*, and we then run the problem with pint* processors. We should have used
p* (not pint*). The amount of time the program runs with the wrong pint* is Tint. We
should have used p* processors which would have given a running time of T.
T = T ( p *)
Tint = T ( pint *)
γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
γ n + π mn
p* = Þ γ n + π mn = p * 2 ( s + g ) (1.1)
s+g
64
γ n + π mn
T ( p*) = + ( ρ + g )m + ( s + g ) p * + s
p*
γ n + π mn γ n + π mn
= s + g + ( ρ + g )m + ( s + g ) +s
γ n + π mn s+g
= γ n + π mn s + g + ( ρ + g )m + s + g γ n + π mn + s
1 1
= 2(γ n + π mn) ( s + g ) + ( ρ + g )m + s
2 2
1 1
= 2 p * ( s + g ) ( s + g ) + ( ρ + g )m + s
2 2
from (1.1)
T ( p*) = 2 p * ( s + g ) + ( ρ + g )m + s
≥ 2 p * (s + g )
γ n + π mn
Tint = + ( ρ + g )m + ( s + g ) p int + s
p int
γ n + π mn
Tint − T ( p*) = + ( s + g ) p int − 2 p * ( s + g )
p int
p *2 (s + g )
= + ( s + g ) p int − 2 p * ( s + g ) from (1.1)
p int
æ p *2 ö
= ( s + g )çç + p int − 2 p *÷÷
è p int ø
æ p * + p int − 2 p * p int
2 2
ö
= ( s + g )ç ÷
ç p int ÷
è ø
æ ( p * − p int ) 2 ö
= ( s + g )çç ÷
÷ perfect square
è p int ø
æ (.5) 2 ö
< ( s + g )çç ÷÷ round p * to the nearest integer
è p int ø
1æs+ gö
< ç ÷
4 çè p int ÷ø
1æs+ gö
ç ÷
Tint − T ( p*) 4 çè p int ÷ø 1
< = bound
T ( p*) 2 p * ( s + g ) 8 p int p *
A change in p* does not substantially affect T.
65
This can also be seen approximately by the second derivative in the Taylor series:
dT − γ n − π mn
= +s+g
dp p2
γ n + π mn
T"=
p3
2( s + g )
T '' = at p *
p
1
Tint − T ( p*) = T ' ( p*)( pint − p*) + T " ( pφ ( pint − p*) 2
2
1 1
| Tint − T ( p*) |≤ T " ( pφ ) since p int − p * < .5
2 4
1 γ n + π mn 1 p * 2 ( s + g ) æ 10 −4 ö
≤ 3
≤ 3
≡ Οçç ÷÷
8 pφ 4 pφ è p * ø
iteration on a 1000 by 5000 size problem, with 10 processors, takes about .1 seconds.
p* is at least 1 and is usually over 10 for even relatively small problems. Rounding p*
Assume we think that startup time is serr, we then calculate perr* based on serr
and run the problem with perr* processors. In fact the startup time is s (not serr). The
amount of time the program runs with the wrong perr* is Terr. We should have used p*
T = T ( p * ( s ), s )
Terr = T ( p err * ( s err ), s )
T = 2 p * ( s + g ) + ( ρ + g )m + s from (1.1)
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
p err
γ n + π mn
Terr − T = + ( s + g ) p err − 2 p * ( s + g )
p err
γ n + π mn
p err = Þ
s err + g
γ n + π mn γ n + π mn γ n + π mn
Terr − T = + (s + g ) −2 (s + g )
γ n + π mn s err + g s+g
s err + g
æ æ 1 2 ö÷ ö÷
= γ n + π mn ç s err + g + (s + g )ç −
ç ç s +g s + g ÷ø ÷ø
è è err
æ s+g ö
= γ n + π mn ç s err + g + −2 s+ g÷
ç s err + g ÷
è ø
æ ( s err + g ) + ( s + g ) − 2 s + g s err + g ö
= γ n + π mn ç ÷
ç s err + g ÷
è ø
=
γ n + π mn
s err + g
( s + g − s err + g )
2
T = T ( p * (π ), π )
Terr = T ( p err * (π err ), π )
γ n + π mn
p* = Þ γ n + π mn = p *2 ( s + g ) (1.1)
s+g
γ n + π err mn
perr * = Þ γ n + π err mn = p *2 ( s + g ) (1.2)
s+g
γ n + π mn − γ n + π err mn
p-perr = Þ
s+g
p 2 ( s + g ) − perr 2( s + g ) = π mn − π err mn Þ
2( s + g )( p − perr )( p + p err ) = 2mn(π − π err ) Þ
2mn(π − π err )
2( s + g )( p − p err ) = (1.3)
( p + p err )
γ n + π mn
T= + ( ρ + g )m + ( s + g ) p + s
p
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
perr
æ1 1 ö π mn π err mn
T − Terr = γ nçç − ÷÷ + − + ( s + g )( p − p err )
è p p err ø p p err
= ( s + g )( p − p err + p − perr )
= 2( p − p err )( s + g )
2mn(π − π err )
= from (1.3)
( p + p err )
68
∂p
− mn
∂ 2T ∂π = − mn mn
= from (1.6)
∂π 2 p 2
p 2 p( s + g )
2
− (mn) 2
=
2 p 2 (s + g )
− (mn) 2
= from (1.1 ) and (1.5)
2(γ n + π mn)
∂T 1 ∂ 2T
T − Terr = (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2
1 æ mn ö
= ç ÷(π − π err ) from (1.6) and (1.7)
2 p err çè s + g ÷ø
∂p 1 ∂2 p
p − p err < (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2
Graphs showing the sensitivity of both p* and Time per iteration to both changes in s
and changes in π.
Table 5.7 shows what happens as the error in startup time (s) increases. The
correct s value is in the middle of the table in italics. It has a 0% error. Both p* and
time per iteration are shown for each error in the last two columns. Figure 5.11 and
Figure 5.12 graph p* and time per iteration, respectively, for the percentage errors in
s. The correct s value is in the center of the horizontal axis at 0% error. As you move
to the right the error assigns s too high a speed. As you move to the left the error
assigns s too low a speed. Note that the time per iteration increases in whichever
Table 5.8 shows what happens as the error in pivot time (π) increases Figures
5.13 and 5.14 correspond to Table 5.8. The analysis of the last paragraph for s
Note that a 30% change in π gives a 10% error in time per iteration and a 40%
s % error p* T
2.694E-04 -40% 67.75 0.0343 0.0360
%
%
%
%
%
%
%
%
%
0
0
0
0
0%
10
20
30
40
50
-4
-3
-2
-1
% error in s
120.00
100.00
80.00
60.00 p*
p*
40.00
20.00
-
%
%
%
%
%
0%
0%
0%
0%
0%
10
20
30
40
50
-4
-3
-2
-1
% erro r in s
0 .0 7 8 0
π % error p* T
0 .0 7 7 0
1.740E-07 -40% 189.52 0.0737
1.616E-07 -30% 182.62 0.0734 0 .0 7 6 0
-40%
-30%
-20%
-10%
200.00
180.00
160.00
140.00
120.00
100.00 p*
p*
80.00
60.00
40.00
20.00
-
%
%
0%
0%
0%
20
40
-4
-2
% ερρορ ιν π
6. Implementation Choices
performing parallel operations. The packages we considered can be used with most
programming languages.
and BSP [Goudreau et al, 1999]. The package to be chosen had to be able to run on a
concern was the ease of use and the portability. Geist [1996] points strongly to PVM.
He claims that MPI has a rich set of functions for point-to-point and collective
communication, but it does not run well on heterogeneous networks. PVM, on the
other hand, is built with the "virtual machine" concept in mind. It should work in a
heterogeneous environment and could handle dynamic process creation. On the other
hand, PVM has greater overhead and will under-perform MPI on an MPP and even
overhead is multiplied.
also has implementations coded. Our application might work with this because it can
be synchronized at certain points. That is a main feature in BSP. We ruled out BSP
6.2 Sockets
programming software that was just mentioned in fact makes use of socket function
calls. There are two categories of sockets. One of the categories, TCP sockets, has
built in error checking. It employs a three-way handshake and insures that packets are
received in order. This is the category that is used by the distributed programming
software. TCP sockets cannot take advantage of the Ethernet’s broadcast facility. The
other socket category is known as UDP. This category is not used by the packages but
does allow the Ethernets broadcast facility to be used. More information on socket
Our application does not dynamically allocate processes and it does send
many small messages in the column selection process. We assume it will be run on a
homogeneous UNIX network. This suggests MPI over the other distributed parallel
packages. MPI is also one of the standard packages used and was available to us. If
our network would have processors with different speeds, PVM would be a little
better, although for both of them load balancing would have to be handled by our
program.
what they must do. One implementation is called LAM (Local Area Multi-computer),
language. It includes communication functions that the processors on the network can
use to communicate. Two of the functions we use are Allreduce and Bcast. These
were explained in Section 5.3.1 in the context of our method’s communication needs;
UDP sockets, on the other hand, allow us to use the Ethernet’s broadcast
capability. This makes a major difference in the scalability of our program. Figures
5.6 and 5.7 show the difference in performance. UDP sockets can safely be used on
local Ethernets where the hardware should deliver the packets in order. The error
checking that is left out in UDP is not necessary on a local Ethernet [Comer and
Stevens, 1996 pg. 13]. MPI is still useful on the larger networks where UDP sockets
cannot be used. We use sockets for empirical testing since they can take advantage of
the Ethernet broadcast, even though both the sockets and the MPI versions were
implemented.
For this reason we decided to directly use socket functions in place of the two
Even though we used sockets, the MPI terminology is still useful. The simplex
method consists, primarily, of one loop. Section 2.1 described the serial simplex
75
tableau method and Sections 5.1-5.2 gave a sketch of the steps for the parallel tableau
method. It was pointed out that within that one loop there are two communications.
gets the maximum bid from each of the processors from its columns and broadcasts
the maximum of these and who the “winning” processor is to all the processors - an
(row choice). In the second communication, the winning processor broadcasts the
Stanford University [Murtagh and Saunders, 1983-1998]. It takes as input linear and
nonlinear programs in the MPS format. Our experiments, detailed in Section 7, relied
on comparing our method with the revised method; we used MINOS as our
representative of the revised method. It is important to note that we are comparing our
algorithm with the revised method in general. It is difficult to directly compare it with
MINOS when implementations of the revised method vary widely based on heuristic
differences. This is explained in more detail in Section 7.2.3. MINOS is often used
for research; for one reason, its source code is available. In our experiments we used
version 5.5.
MINOS to our program. More comprehensive details of how to use MINOS can be
found in the MINOS user's guide [Murtagh and Saunders, 1998]. MINOS takes two
76
files as input: An MPS data file and a specification file. The specification file tells
MINOS the features and parameter values to use when solving the problem. MINOS
defaults to using a crash procedure to get an initial basis [MINOS User’s Guide
Chapter 3]. Our code does not. In order to make comparisons more direct we disabled
it in MINOS, using "CRASH OPTION 0" in the spec file. MINOS will now simply
choose all the slack variables as the basis. We also set "SCALE OPTION 0" so the
problem would not be scaled. This is important because our program and MINOS
have different scaling methods. Below is the MINOS specification file that we used.
BEGIN general
PARTIAL PRICE 1
SCALE OPTION 0
CRASH OPTION 0
MPS FILE 10
PRINT LEVEL 1
PRINT FREQUENCY 1
SUMMARY FREQUENCY 1
END general
MINOS uses Partial Pricing. As noted in Section 3.2, the revised benefits
from a large n to m ratio. Another way the revised can reduce computations is by
avoiding the pricing out of every column. Instead of getting the best value from every
column one can simply choose from a subset of the columns. This is called partial
pricing as opposed to full pricing. By default MINOS will use partial pricing when n
and Saunders, 1998 Ch. 3]. In order to make comparison more direct we disabled
partial pricing with the line "PARTIAL PRICE 1" so that all columns are searched to
The revised method makes extensive use of reinversions. There are three
procedures [Gill et al, 1987] and c) refactorization of matrices used in the revised
method [Chvátal, 1983]. A full tableau method would only do a reinversion for the
first two reasons. Reinversions for the first two reasons are executed infrequently
whereas refactorizations are quite frequent. Our serial algorithm has a “refresh”
procedure built in. This was usually disabled for the purposes of experimentation
7. Computational Experiments
varying sizes, aspect ratios and densities using our serial method, our parallel method
and MINOS. Section 7.1.1 discusses these synthetic problems. Tests were also
performed on problems in the Netlib library (see Section 7.1.2). These tests were also
used to validate the models and to compare our standard method with the revised
method.
densities. At the same time we wanted to use more realistic problems. These
density of the non-zero coefficients (0 < d ≤ 1), and seed = the seed for the random
number generator; in addition the user specifies a file descriptor for the mps output,
generator
All the constraints are of the less than equal type. Whether a constraint
coefficient is positive (or zero) is determined randomly with probability s. The value
of a non-zero coefficient is chosen uniformly between 0 and 1. The right hand side
coefficients are all 1. The objective coefficients are all -1, with the exception of those
corresponding to columns that, by chance, end up with all 0 coefficients. In this case
79
solutions. Thus excluding these zero columns, we seek to maximize the sum of the
variables. The initial solution determined by all the variables = 0 is feasible so that no
numbers) that the actual density of the problem is exactly or even close to s. The
program does report the actual density. The output is an mps format file defining the
generated LP.
that there may be covert, underlying structure that makes the problem much more
special than problems that might appear in practice. This was recognized early on by
Kuhn and Quandt [1963]. They proved a theorem that applies to generator, which
gives an asymptotic estimate of the value of the objective. Luby and Nisan [1993]
problems; this also applies to the problems generated by generator. Thus there are
obviously special features of this class of problems, which make them easier.
Fortunately, this is revealed more by the number of iterations than the work per
iteration. Since our methods apply to savings within iterations, these considerations
generator1
The constraints are generated as in generator, and they are also all less than or
equal constraints. The objective coefficients are now generated randomly between -1
and -0.5. If the column has all zero coefficients in the constraints the sign is reversed.
The right hand side coefficients are also generated randomly, uniformly in the range
80
0.5 to 1. The Kuhn-Quandt Theorem no longer applies, but the Luby-Nisan Algorithm
does.
generator2
In this generator we again have less than or equal constraints. The non-zero
matrix elements are generated uniformly between -1 and 1. The objective coefficients
between -m and m. The constraints are constrained to range between -1 and 1. Again,
Theorem nor the Luby-Nisan Algorithm apply to the results of this generator. Notice
that this (and only this) generator requires the RANGE feature of the solver. The
RANGE feature provides for upper and lower bounds of the constraints as well as the
variables.
Figure 7.1 shows the total time as density increases for the three generators.
Figure 7.2 shows the time per iteration as density increases for the three generators.
Note that the total running times vary widely for the three types of generators while
the times per iteration are very close. This supports our view that the type of synthetic
problems used affect total running time more than the time per iteration.
81
T im e b y G en erato r v s D en sity
40 0
35 0
30 0
25 0
20 0
Time (secs.)
15 0
10 0
50
0
0 0 .2 0 .4 0. 6 0. 8 1 1.2
D en s ity
0.033
0.032
0.031
0.03
0.028
0.027
0 0.2 0.4 0.6 0.8 1 1.2
Density
testing linear programming code. The Netlib problems are in general sparse. Table
Section 5 provided running time projections for our serial and parallel
programs. We used our models to pick the optimal number of processors to use. We
then were able to compare the running times of both our parallel and serial algorithms
with the revised simplex algorithm and to characterize which types of problems our
methods are good for. This analysis shows the advantages of our dpLP parallel
program for all problem sizes. This was discussed in Section 5. The parallel
program’s expression had a computation part and a communication part. We also had
a separate computation expression for the steepest edge column choice rule. In this
section we validate those expressions by comparing the actual running times for
the coefficients of the terms in our expressions. These coefficients would vary with
expressions that can be verified in our environment. Two are for computation, one for
the standard column choice rule and one for the steepest edge rule. One is for
Similarly, for our serial full tableau program we have two expressions and for
choice time per unit vector element (ucol), b) row choice time per unit vector element
(urow), and c) pivot time per unit matrix element (upiv). These constants are defined
87
in Section 5.5. The constant terms required for the communication expressions are s
There are, in general, two methods that we employed to get the coefficients.
One is by directly estimating those coefficients. The second method applies linear
regression to actual runs of the linear programming code to estimate the values of the
coefficients. If the regression produces a tight fit we can be confident that the
coefficients are accurate and that the expression correctly estimates the running times
of the programs.
corresponding to that coefficient. We then divide the time by the variables that are
multiplied by it in the expression. For example, in order to find upiv we time the
expression.
these problems together with problems from the Netlib library. In order to verify the
parallel dplp expressions the problems were run in parallel using multiple processors.
The smallest number of processors used was 2 and the largest was 7.
Figure 7.3 plots time per iteration against mn for the standard column choice
rule. Figure 7.4 is a similar graph for the steepest edge rule and is explained later in
this section. In Figure 7.3 the coefficient upiv dominates, especially for large
problems. This is because the pivot step in fact takes about 95% of the computation
time. The other two coefficients can actually be left out of the expression. One can
see from the figure that as mn grows so does the running time. The points of the
88
actual running times almost completely lie on the projected value line. This verifies
that the running time is almost completely dependent on mn. Since mn is the pivot
term of the expression, it justifies our leaving out the other terms. In fact, regression
was only used to find out the value of upiv. The other coefficients were not accurately
estimated when using regression, probably due to the fact that upiv dominates the
other coefficients.
Below are the values obtained using the direct timing of the 3 steps of an
iteration. These values come from Table 5.1 and 5.2 in Section 5.4. We also include
upiv as estimated using regression even though its value was not used in the formulas.
ucol_se is the unit cost for the steepest edge column choice rule. This value is
Ucol 3.73E-8
Urow 1.65E-06
Upiv 1.24E-07
ucol_se 3.47E-07
Only the pivot coefficient and term of the expression are used to estimate the
timing. The other terms are negligible. The estimation was applied to a number of
obsevation − estimate
100 .
estimate
The mean percentage relative error observed amongst these problems was
5.00%. It is important to note that most large problems had a relative percentage error
89
of less than one percent. Unfortunately a few of the smaller problems gave larger
errors, which pulled the average up. The maximum relative error was 19.25%. As we
explained, the time taken during the pivot step takes the main bulk of time. The time
taken for other steps are relatively insignificant. For small problems, that assumption
1.4
1.2
0.8
0.6
0.2
0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
mn
4.5
3.5
2.5
0.5
0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
mn
Actual iteration times Projection when 100% SE columns looked at Projection when 0% SE columns looked at
Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule)
92
The serial time expression is essentially the same as our parallel expression.
The only difference is that it uses only one processor. We used the same coefficients
obtained for the parallel program’s expression for the serial expression. We executed
the serial program for the same group of problems described above. We then took the
average error between the estimation and its actual running time. Our serial program
gave 15.34% and 7.73% for the maximum and average relative percentage errors
respectively. Again the small problems pulled up the mean relative percentage error.
The expression for the steepest edge rule is different from the computation
columns (see Section 3.3). This could roughly double the number of significant
operations compared to using the standard column choice rule. Based on Table 5.2
this will actually cost, on our network, between three and four times the total
computation time per iteration compared to using the standard column choice rule.
This assumes that all the columns are eligible. In practice, however, many columns
are not eligible. In one empirical test we found that only 35% of the columns were
eligible; the other columns could be immediately eliminated. The cost of an iteration
is therefore upper bounded by twice the number of operations and between three and
four times the time cost of an iteration (on our network) when the standard column
choice rule is used. This upper bound is rarely reached. For the steepest edge column
choice rule, therefore, the se_ucol coefficient is also significant. Note that this
coefficient is different than the ucol coefficient in the standard column choice rule
93
discussion. se_ucol here represents the unit cost of doing the steepest edge column
choice rule assuming all columns are looked at. The value of se_ucol was listed
In order to accurately estimate the running time of the program when using
steepest edge we must know the percentage of columns actually looked at during the
program. This percentage would then be multiplied by ucol. This is not known before
a program is solved and we therefore cannot accurately estimate the running time in
Figure 7.3. Furthermore if we use the generic 35% number mentioned above we do
get a reasonably good estimate of the running time. Assuming we know the number
of columns actually looked at during execution, 17.72% and 8.22% are the maximum
A graph similar to the one shown for the standard column choice rule is
provided in Figure 7.4. The horizontal axis, as in Figure 7.3, is the problem size mn.
Figure 7.4 shows two lines. The top line corresponds to problems where the program
looks at every column within the steepest edge column choice rule. The bottom line
corresponds to a problem where the program looks at none of the columns within the
steepest edge column choice rule. In practice a percentage of the columns are looked
at as we just explained. Note that the actual running times per iteration fall in between
these projected lines. This verifies our steepest edge rule timing projection.
Communication time
communication time
The timing for communication is the wall clock time. It is important to run
communication tests during network idle time to avoid time accruing from other
processes running.
Another issue is to make sure that the communication times are accurate for
more than two communicating processors. To this end we estimated s and g in the
context of broadcast and all reduce. This verified the accuracy of s and g even when
Communication time can be divided into two parts. First, before the first
message can be read, the reading processor might be waiting for the sender to finish
its computation. This is referred to as wait time. The second part is the actual
command before the timing of the communication within the program. The only need
95
for an explicit barrier command is for this particular timing test. This command
separates the effect of processors waiting for other processors from the
ms to the wait time. We timed this by putting a number of barrier commands inside a
loop.
the socket interface was substituted. This surprisingly decreased not only the wait
because after the processor enters the barrier, it lets the processors leave at slightly
different times. The new socket barrier method seems to take away most of the
overhead the MPI barrier had. For the problems tested, the wait time plus the
communication time were almost the same as the communication time that was
Table 7.2 compares percentage errors in two groups of problems. The first is a
group of 24 problems each of which was executed using 2 processors. The wait time
was separated from the communication time by use of socket calls. The second group
used the same problems as the first group. This group, though, was executed once
using 2 processors and then with 3 processors… all the way up to 7 processors. This
gives a total of 144 runs. The table contains both the maximum and average relative
percentage error for these two problem groups. The rows correspond to s and g values
96
derived from different sources. The first row shows the s and g that result from direct
experimentation. The bottom two rows show the s and g that result from regression.
Table 7.3 lists the 24 problems that were used with their sizes and densities. The first
10 problems, with names beginning with “d” are synthetic problems generated by
generator (the first one). Note however, that for this experiment the densities are
name m n d
These tests were repeated on the large problems. (Four of the 24 problems
were excluded.) In this set of 20 problems one had 50,000 matrix elements, and the
other 19 had at least 100,000 elements. The results were virtually the same (within .5
Wait time
Wait time is the time that processors spend at a synchronization point waiting
for other processors to finish computation. In general this time should be short if the
load is evenly distributed amongst the processors. This waiting time is actually a
function of the computation parameters m, n, and p. The longer the computation, the
longer two different processors might vary in their computation time. For the classical
column choice rule, the large cost of pivoting is what causes most of the wait time.
98
The classical column choice rule is insignificant in terms of time. In the steepest edge
rule, both the cost of pivoting and column choice heavily contribute. Only one
processor does the row choice, which is why it does not contribute to wait time but is
We timed many pivots on a constant size dense matrix. We found very small
random discrepancies in time between the pivots. Each pivot step does the same
number of calculations. Since the discrepancies were very small and random, we
assume it comes from something random within the computer. For each iteration the
processors must wait for the slowest pivot. This wait time of one iteration should be
equal to the maximum pivot time of the processors minus the minimum pivot time of
the processors. Sum this per-iteration wait time over all iterations. This sum should be
In our small problems, the computation time is much larger than the
communication time. As a result, the wait time is greater than the communication
time. This should change as the optimal number of processors is reached. At that
communication time will then be much larger than the wait time.
The revised method has several variants. They all go through the same basic
steps that use the inverse of the basis or some functionally equivalent representation.
For a more detailed discussion see [Nash, 1996] and [Chvátal, 1983]. The basic steps
are as follows:
Steps A and C make use of the “basis inverse” while step E keeps the “basis
inverse” current. Step E executed every iteration for the case of the explicit inverse. If
step E has a cost of about m2, where m is the rank of the basis.
performed even in the explicit inverse for the sake of numerical accuracy. Refresh is
A very big factor in the running time of the revised method is sparsity. There
are two types of sparsity. The first is the sparsity of the original data. The second is
the sparsity of the inverse or its equivalent. Fill-in is the term used when the “inverse”
Steps A and C can always take advantage of the first type of sparsity. The
explicit inverse representation of the revised can only be expected to take advantage
of the first type of sparsity. This is because there is only one inverse and in general
the inverse of a matrix will be dense even for a sparse matrix. On the other hand,
there are many possible factorizations of a matrix. This allows a “smart” factorizing
100
construction to choose one with very little fill-in. This is implemented by heuristically
choosing pivot elements that result in a sparse factorization. This allows inverse
A, C and E, in these schemes, can take advantage of the second type of sparsity. Eta
factorization and triangular factorization are two ways of factorizing the inverse.
MINOS uses triangular factorization. It adds Eta vectors for each pivot until the next
refactorization.
The MINOS User’s guide says [Murtagh and Saunders, 1998]: “MINOS
al [1987]. For a description of the concepts involved see Reid [1976, 1982]. The basis
“LU density tolerance.” It changes the refactorization based on density. MINOS 5.5
come up with a performance model for MINOS that takes all the heuristics into
account.
We can make an expression for the revised method that would take the first
type of sparsity into account. In Section 5 the graphs and discussion assumed an
expression that uses the explicit inverse form of the revised method. This is not the
way MINOS implements the revised but it’s close to the upper bound when the
101
second kind of sparsity is assumed not to occur. The sparsity in the functional
equivalent of the basis inverse is unknown before solving the problem. The revised
expression can theoretically be verified by going into the MINOS source code and
calculating the fill in that occurs every iteration, similar to what we did for the
steepest edge expression in our full tableau program. MINOS is not our code and we
didn’t do that.
density rises and as the number of processors rises. This is shown in the next section.
Figure 7.5 corresponds to Figure 5.3 and Figure 7.6 corresponds to Figure 5.6
and 5.7. Note that Figure 5.7 uses more processors than we have. Tables 7.4 and 7.5
We ran a problem of size 1,000 by 5,000. For Figure 7.5 and Table 7.4 we
used 5% density. We stopped the program after 500 iterations. For those few that did
500
not run for that many iterations, we scaled the time it took by time . This gives
pivot
the time it would take for 500 iterations. Only 3 problems needed this.
Figure 7.5 and Table 7.4 compare MINOS and retroLP over varying densities.
We can see that for this problem, at somewhere between 70% and 80% density,
Figure 7.6 and Table 7.5 compare MINOS and dpLP on a problem with 5%
density. The number of processors is increased until 7. The optimum value is in fact
about 53 processors. MINOS takes 24.24 seconds whereas dpLP when run on 7
prediction. The same model predicts a running time of 11.73 seconds if dpLP would
be run over the optimal number of processors. From the graph, we can also see the
steady speedup as the number of processors increases. It was not leveling out at 7
processors.
103
450
%
5%
10
20
40
50
60
70
80
90
0
10
Density
Revised-MINOS Serial-retroLP
350
time per 500 iterations
300
250
200
150
100
50
0
1 2 3 4 5 6 7
Processors
As noted in Sections 4.1.2 and 5.11, one of the advantages of using a full
tableau parallel algorithm is the ability to take advantage of more complicated column
choice rules. Figure 7.7, “retroLP vs. MINOS”, shows total running time as a function
of density for problems with m=500 and n=1,000. It shows retroLP with both the
Dantzig and steepest edge column choice rules. It also shows MINOS (revised
MINOS time divided by retroLP time as a function of density for the same data. The
density at which the Dantzig column choice rule takes over MINOS is around 70%.
The density at which the steepest edge column choice rule takes over MINOS is
between 2% and 5%. The points in both of these figures represent nine runs each, one
run for each of the three generators combined with three different seeds.
Figure 7.8. It shows MINOS time divided by retroLP time as a function of density for
problems with m=1,000 and n=5,000. These runs executed a tableau reinversion once
every 5,000 iterations. This reinversion cost is very close to 20% extra time for the
Dantzig column choice rule and about 15% extra time for the steepest edge column
choice rule. This is why the Dantzig version doesn’t catch up with the revised in this
figure.
It should be noted that although partial pricing doesn’t help in retroLP for the
classical column choice rule, it would make a big difference in the steepest edge rule.
106
Section 5.3. The UNIX environment was used for all the other timing.
107
350
300
250
200
Time (secs.)
150
100
50
0
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Density
2.5
1.5
Time (secs.)
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
0.5
0
Density
3.5
2.5
Time (secs.)
1.5
1
0 0.2 0.4 0.6 0.8 1 1.2
0.5
0
Density
8.1 Summary
In conclusion our method has made large linear programs more tractable. It is
especially good for large dense problems or when we have use of the optimal number
of processors (even for problems that are not dense). By taking advantage of
parallelism it can also divide the extra load of alternate column choice rules, which
We have
the full tableau method. This implementation runs both on UNIX machines
and on PC’s
4) Determined at what density our method becomes more efficient than existing
5) Analyzed the number of processors needed to make our parallel method more
problems
111
6) Analyzed when the other column choice rules, in particular the steepest edge
column choice rule, can help our parallel method achieve faster total running
achieved at lower densities and with fewer processors than when using the
There are a number of applications that lead to dense linear programs. One is
data mining [Bradley, 1999] and text categorization using the method of [Bennett and
Mangasarian, 1992], [Bosch and Smith, 1998]. The idea is, given a collection of
different articles and a group of categories, to put each article into its proper category.
We can take a document and for a given category decide whether or not the document
is a member of that category. First n keywords are chosen to help distinguish between
categories. The variables in the LP correspond to these words. For each article each
keyword is counted to get its frequency in that article. The vector of these frequencies
defines a point in n space, which corresponds to a row in the LP. For most groups of
words the resulting tableau will be sparse. If instead of using individual words we
aggregate groups of words, the problem will become smaller and denser. The
grouping is known as feature compression and extraction. One solves the resulting
dense linear program to find a hyperplane that separates the points in the category
1999 pg. 419] give rise to other dense applications. LP Relaxations of Machine
schedule a number of tasks in such a way as to minimize the total time elapsed. The
rows correspond to the points in time. The variables (columns) correspond to the
tasks.
digital filter design, data analysis and classification and financial planning.
a. To analyze whether using dpLP with other column choice rules such
structures.
following form:
113
Maximize c1 x1 + c 2 x 2 + c3 x 3 + c4 x4 + L L c n −1 x n −1 cn xn
Subject to a11 x1 + a12 x 2 op b1
a 21 x1 + a 22 x 2 op b2
a33 x3 + a34 x 4 op b3
a 43 x3 + a 44 x 4 op b4
L L M M
L L M M
a m −1n −1 x n −1 a m −1n x n op bm −1
a mn −1 x n −1 a mn x n op bm
where op refers to any of the relations =, <= or >= and the
variables can be bounded from above and below.
finish their pivot and column choice at the same time. This might
enhancements for the case of using networks other than a simple Ethernet.
E. Dynamic load balancing and fault tolerance. Figuring out how to deal with
F. Use of partial pricing for the steepest edge rule in the full tableau method.
115
Maximize z = cx
x
Subject to Ax op b
lj ≤ xj ≤ uj j = 1,..., n
where op refers to any of the relations =, <= or >= .
A is the constraint matrix. l and u are the lower and upper bounds respectively.
lj and uj can be negative or positive infinite. If both bounds of a particular variable are
infinite, the variable is said to be “free.” If both bounds of a particular variable are the
same (lj=uj), the variable is said to be “fixed.” If lj is not equal uj and both are finite
language we often denote vectors by "[ ]" and matrices by "[ ][ ]."
retroLP and dpLP both use the simplex method with bounded variables. They
(a[ ][ ], nl[ ], nu[ ], m and n are actually passed via the structure given in Table A.1
116
The output of the simplex is in the a[ ][ ] matrix at the end. Another function
Much of the data is stored in the matrix a[ ][ ] which has m+2 rows and n+1
columns where n is the number of variables and m is the number of constraints. The
0th column holds the b vector and the 0th row holds the c (objective) vector. The
(m+1)th row stores the Phase 1 objective. Constraints in the matrix can be a mixture
of less than, greater than and equalities. Vectors nl[ ], bl[ ], nu[ ] and bu[ ] hold the
upper and lower bounds of the variables. nrange[ ] and brange[ ] are lists of flags
indicating whether a variable is currently in between its bounds, lower than a lower
bound, at its upper bound or at both bounds (for fixed variable only). The values of
nrange and brange are determined by the program and need not be input.
These data structures describing the linear programming problem are all
typedef struct
{
char *name; // name of problem (usually file name (100))
long m; // number of rows
long n; // number of columns
// (actually, the matrix is (m+2)x(n+1))
long mm; // index of the current objective row; m+1
// for Phase I; 0 for Phase 2.
double ** a; // points to the constraint matix ((m+2)x(n+1))
// (n+1)
var_type *ntype; // types of non-basic variables: fixed.
// lower bounded, upper bounded, both, free. (n+1)
double *nl; // lower bounds of non-basics n
double *nu; // upper bounds of non-basics n
long *jnonbasic; // indices of current non-basic variables
var_range *nrange; // non-basic values within, at, below,
// or above bounds.
double *x; // current value of non-basic variables (to
// implement EXPAND) (n+1)
var_type *btype; // types of basic variables: fixed. lower
// bounded, upper bounded, both, free. (n+1)
double *bl; // lower bounds for basic variables
double *bu; // upper bounds for basic variables
long *ibasic; // indices of current basic variables (m+1)
var_range *brange; // basic values within, at, below,
// or above bounds.
double *b; // current value of basic variables
} LP_state;
Table A.1
Table A.1-Data structure for retroLP and dpLP
linear programming problem and fill the data structures just mentioned.
118
Maximize z = 2x + 2 y − 5z
Subject to 5x − 4 y + 3z ≤ 4
5x + 3y + 3z ≥ 2
2x + 3 y − 4 z = 10
2 ≤ y ≤ 10, x, z ≥ 0
First add a slack, surplus and artificial to the constraints (this can be done
implicitly).
Maximize z = 2 x + 2 y − 5z
Subject to 5x − 4 y + 3z + s1 = 4
5x + 3 y + 3z + s 2 = 2
2x + 3 y − 4 z + s3 = 10
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0
Maximize z = 2x + 2 y − 5z
Subject to s1 = 4 − 5 x + 4 y − 3z
s2 = 2 − 5x − 3 y − 2 z
s 2 = 10 − 2 x − 3 y + 4 z
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0
0 2 2 −5
4 −5 4 −3
A[ ][ ] = 2 −5 −3 −2
10 −2 −3 4
0 0 0 0
nl[ ] = 0 2 0
nb[ ] = 0 0 0
nu[ ] = INF 10 INF
nb[ ] = INF INF INF
119
nrange[ ]=L L L - (U or L, BOTH or FREE), where U=at its upper bound, L=at
its lower bound, BOTH means it’s a fixed variable, and FREE means it’s a free
There are 3 basic variables (a slack, surplus and artificial variable) corresponding
The top (0th) row is the objective the bottom (m+1)th row is place for a phase 1
objective. The first (0th) column is the right hand side constants. The resulting
linear and integer programs. More details on MPS format can be found in
Murtagh [1998]. It is the format currently supported by our programs. MPS has a
fixed and free format. The fixed format is the one used by MINOS and our code;
it is one we will describe. Each row in the file has fields, which are in the specific
Figure A.1
keywords delimiting the different sections of the file. They all begin in column 1
of their respective rows. The row starting in column 1 with “NAME” can have an
8-character problem name in Field 2. Every row in the ROWS section has a one-
character in Field 1 specifies the type of constraint that row will contain. There
are four possible row types. They are an objective (N), an equality (E), a less than
contain a row name - value combination. We then have a value to put into the row
and column given in Fields 3 and 2 respectively. Fields 4 and 5 contain another
row name-value. Fields 4 and 5 can be left blank. It is also legal to leave Fields 4
and 5 blank.
The RHS section consists of a right hand side (rhs) name in Field 2. Fields 3
then have a value to put into the row and rhs given in Fields 3 and 2 respectively.
(The MPS format supports multiple right hand sides; our implementation allows
only one.) Fields 4 and 5 contain another row name-value. Fields 4 and 5 can also
be left blank.
121
Every row in the BOUNDS section has a two-character keyword (UP, LO,
FX, FR) in Field 1 followed by a bound name in Field 2. (The MPS format
supports multiple bounds; our implementation allows only one.) The keyword in
Field 1 specifies what type of bound the variable (column name) specified in
Field 3 will be. There are four possible bounds. They are an upper bounded
variable (UP), a lower bounded variable (LO), a fixed bounded variable (FX) or a
combination that specifies a value in the column name (variable) given for the
bound in Field 2 (for a given problem solution there will be only one bound).
NAME TESTPROB
ROWS
N COST
E EQ1
E EQ2
COLUMNS
XONE EQ1 1
XTWO EQ2 1
XTHR COST - 10 EQ1 - 1
XTHR EQ2 -1
XFOUR COST 100 EQ1 1
XFOUR EQ2 1
RHS
RHS1 EQ1 2 EQ2 3
BOUNDS
UP BND1 XTHR 1
UP BND1 XFOUR 1
LO BND1 XFOUR - 10
ENDATA
122
There are 3 rows; the first is row "COST" which is an objective row denoted by
keyword N. The second and third are rows called "EQ1" and "EQ2" which are
equality rows denoted by keyword E. Another two keywords not in this file are G and
L for greater than and less than constraints. There are four columns with names
"XONE", "XTWO", "XTHR" and "XFOUR". On the right of the column name are
one or two row names with values indicating all the values for that column. Next the
right hand side (b vector) is given in the same way the columns were given. Finally
there are three bounds. Two upper bounds denoted by keyword UP and one lower
bound denoted by LO. There are another two types of variable bounds; FX for fixed
variable and FR for free variable. There is also another section called RANGES.
(BND1 is just the "name" given to the bound in case there is another set of bounds.
RHS1 is the same. Usually there is only one RHS and one set of bounds.) Our
implementation only looks at the first set of bounds or RHS’s if there are more than
one. All values not mentioned are assumed to be 0. The problem can be a max or min
although it is usually assumed to be min. If bounds are not given for a variable
Further information on the MPS format can be found in [Murtagh, 1998] and at
ftp://softlib.cs.rice.edu/pub/miplib .
123
in C++ compatible C. It takes input in the MPS format and supports all the options for
linear programming implied by the format except that multiple runs are not yet
supported. That is, our implementation expects at most one set of right hand side
Three column choice rules are supported: The classical rule of Dantzig, the
steepest edge rule, and the maximum change rule. The algorithm can be easily
extended to allow the same problems to use differing column choice rules in different
iterations.
reinversion. The same procedure can be used to support basis crashing and warm
restarts. We use the specification for MPS given in Murtagh and Saunders [1998].
retroLP is effective for dense linear programs with moderate aspect ratio.
Such problems arise, for example, in digital filter design, image processing, curve
fitting, and pattern recognition. The program can start from any assignment of values
to the variables, within bounds or not. In particular, retroLP can be used in a hybrid
computation with an interior point method along the lines suggested by Bixby et al
[2000].
Within column() there are many different possible column choice rules only one of
which is usually used for a given run, although mixing them is possible.
retroLP first preprocesses data that comes in the MPS format. This was
described in Appendix A. The main simplex routine fills up row m+1 with the Phase
Phase 1
loop on Phase 1.
125
Phase 2
1) Do column selection.
over.
loop on Phase 2.
dLP first preprocesses data that comes in the MPS format. This was described
in Appendix A. dpLP then divides the n columns into p groups. Each of the p
processors gets approximately n/p of the columns. Each processor stores its data in
The main simplex routine fills up row m+1 with the Phase 1 objective for all
Phase 1
1) Each processor does column selection on its columns; the global max is
2) The winning processor does row selection. It selects the row whose constraint
3) The pivot column of the processor with the global max (winning processor) is
broadcast to all the processors together with the pivot row. Do a pivot on
element (ip,kp). The processors do this alone using the identical copy of the
winning column.
loop on Phase 1.
In Phase 2 each processor will use their row 0 for the objective.
Phase 2
127
1) Each processor does column selection on its columns; the global max is
over.
2) The winning processor does row selection. It selects the row whose
3) The pivot column of the processor with the global max (winning
processor) is broadcast to all the processors together with the pivot row.
loop on Phase 2.
128
References
Two Linearly Inseparable Sets,” Optimization Methods and Software vol. 1 1992
pgs. 23-24.
Bixby, Robert E. and Alexander Martin, "Parallelizing the Dual Simplex Method,"
Bosch, Robert and Jason Smith, “Separating Hyperplanes and the Authorship of the
Bruck, Jehoshua, Danny Dolev, Ching Ho, Rimon Orni and Ray Strong, “PCODE:
Comer, Douglas and David Stevens, Internetworking With TCP/IP Volume III:
Culler, David and Richard Karp et al, “LogP: Towards a Realistic Model of Parallel
Dongarra, Jack and Francis Sullivan, "Guest Editor's Introduction: The Top Ten
23.
ORSA Journal on Computing (INFORMS) vol 7 no. 4 Fall 1995 pgs. 402-416.
Eiselt, H.A. and C.L. Sandblom, "External pivoting in the simplex algorithm,"
Eiselt, H.A. and C.L. Sandblom, "Experiments with External Pivoting." Computers
Forrest, John and Donald Goldfarb, “Steepest-edge simplex algorithms for linear
Geist, G.A., J.A. Kohl and P.M. Papadopoulos, "PVM and MPI: a Comparison of
general sparse matrix,” Linear Algebra and its Applications vol. 88-89 1987 pgs. 239-
270.
131
Gill, P.E., W. Murray, M.A. Saunders, M.H. Wright, “A practical anti-cycling procedure
pgs. 437-474.
Goudreau, Mark, Kevin Lang, Satish Rao, Torsten Suel and Thanasis Tsantilas,
“Portable and Efficient Parallel Computing Using the BSP Model.” IEEE
Hall, J.A.J. and K.I.M. McKinnon, “Update procedures for the parallel revised
Memory Machines,” Handbook for Theoretical Computer Science (J. van Leeuwen
Karp, Richard and A. Sahay, E.Santos and K. Schauser, "Optimal Broadcast and
Kuhn, Harold and Richard Quandt, “An experimental study of the Simplex Method,“
Luby, Michael and Noam Nisan, “A Parallel Approximation Algorithm for Positive
Mallat, Stéphane, A Wavelet Tour of Signal Processing. 2nd ed. Academic Press 1999 pg.
419.
Murtagh, Bruce and M Saunders, “MINOS 5.5 User's Guide” Technical report SOL
Nash, Stephen and Ariella Sofer, Linear and Nonlinear programming. McGraw-Hill
1996.
University 1995.
Reid J.K., “Fortran subroutines for handling sparse linear programming bases,”
Snir, Marc and Steve Otto et al, MPI: The complete Reference. MIT Press 1996.
(0-89791-278-0 )
advances in mathematical programming, Graves and Wolfe eds., McGraw Hill New
York 1963.