Simplex

A Distributed Implementation of the Simplex Method
DISSERTATION
Submitted in Partial Fulfillment of the Requirements for the degree of
DOCTOR OF PHILOSOPHY
(Computer & Information Science)
at the
POLYTECHNIC UNIVERSITY
by
Gavriel Yarmish
March 2001
Approved:
Department Head
Date
Copy No.__________
ii
Copyright 2001 by
Gabriel Yarmish
All rights reserved.

iii
Major: Computer Science

____________________
Richard Van Slyke
Professor of
Computer & Information
Science
___________
Date
____________________
Alex Delis
Associate Professor of
Science
___________
Date
____________________
Torsten Suel
Assistant Professor of
Science
___________
Date
Minor: Financial Engineering

____________________
Frederick Novomestky
Industry Associate
Professor of
Management
___________
Date
iv
Microfilm or other copies of this dissertation are obtainable from:
UMI Dissertations Publishing
Bell & Howell Information and Learning
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Michigan 48106-1346

v
Gavriel Yarmish was born in the United States. He was awarded his BS in
Computer Science, Magna Cum Laude, from Touro College in 1991 and his MA in
Computer Science from Brooklyn College in 1993. He is a member of Tau Beta Pi,
the engineering honor society, and has taught Computer Science and Mathematics
since 1995. The research presented in this thesis was completed between 1994 and
2001.
vi
This Dissertation is dedicated to my family, all of whom have lent
encouragement and support during the time spent on this research and before.
vii
Acknowledgments
I wish to express my sincerest thanks to the chairman of my dissertation
committee, Dr. Richard Van Slyke, for working with me throughout this long
enterprise. We thank the staff at the Computer Science Department in Polytechnic
University. Jeff Damens, our system administrator, helped install and troubleshoot
MINOS and MPI. He has also responded quickly to all network-related problems.
Torsten Suel provided input regarding various communication models. I also wish to
thank Joel Wein for the use of his lab during the early stages of my research and
Boris Aronov for his help and concern throughout the progress of this study. R. N.
Uma helped to explain the setup of MPI in the computing labs. I wish also to
acknowledge my good friend Jacob Maltz for his help with UNIX shell scripting and
for other troubleshooting throughout the empirical studies. I wish also to
acknowledge my committee members, Alex Delis, Frederick Novomestky, and
Torsten Suel for taking the time to critique the dissertation.

viii
Abstract
A Distributed Implementation of the Simplex Method
by
Gavriel Yarmish
Advisor: Richard Van Slyke
Submitted in Partial Fulfillment of the Requirements for the degree of
DOCTOR OF PHILOSOPHY
(Computer & Information Science)
March 2001
The Simplex Method, the most popular method for solving Linear Programs
(LPs), has two major variants. They are the revised method and the standard, or full
tableau method. Today, virtually all serious implementations are of the revised
method because it is more efficient for sparse LPs which are the most common.
However, the full tableau method has advantages as well. First, the full
tableau can be very effective for dense problems. Second, a full tableau method can
easily and effectively be extended to a coarse grained distributed algorithm. While
dense problems are uncommon in general, they occur frequently in some important
applications such as digital filter design, text categorization, image processing and
relaxations of scheduling problems.
We implement two full tableau algorithms. The first, a serial implementation,
is effective for small to moderately sized dense problems. The second, a simple
extension of the first, is a distributed algorithm, which is effective for large problems
of all densities.
ix
We developed performance models that predict running times per iteration for
the serial version of our method, the parallel version of our method and the revised
method for problems of different sizes, aspect ratios and densities. We also developed
methods for choosing the number of processors to optimize the tradeoff between
computation and communication in distributed computations. We tested our
algorithms on practical (Netlib) and synthetic problems.

x
Table of Contents
1. Introduction ------------------------------------------------------------------------------------ 1
2. Related Work ---------------------------------------------------------------------------------- 4
2.1 Interior point methods 4
2.2 Parallel implementations for the simplex method 4
3. Review of the Simplex Method ------------------------------------------------------------- 6
3.1 A short review of linear algebra 6
3.2 Definition of a Linear Program 7
3.3 Short description of the full tableau simplex method 10
3.4 Short description of the revised method 13
3.5 Running time comparison of the revised method and the full tableau
method 16
4. Motivation for a serial and distributed full tableau implementation -------------21
4.1 Why the method is applied to full tableau 21
4.1.1 No pricing out or column reconstruction 21
4.1.2 Alternate column choice rules can easily be used 21
4.1.3 The inverse does not have to be kept in each processor 23
4.1.4 Dense problems do not gain by use of the revised 23
4.2 Distributed computing 23
4.2.1 Why use distributed over MPP – no need for a supercomputer 23
4.2.2 Why coarse grain parallelism is used 24
5. Models and Analysis-------------------------------------------------------------------------25
5.1 Synchronized parallel pivots 25
5.2 Short sketch of one distributed pivot step 27
5.2.1 Choice of four models of parallel communication 28
5.2.2 Basic explanation of communication parameters 29
5.2.3 Analyzing basic communication operations 31
5.3 The Experimental Environment 34
5.3.1 Description of lab(s) 34
5.4 Estimates of computation parameters ucol, urow and upiv 34
5.5 Estimates of communication parameters s, g and L 36
5.6 The performance model 40
5.7 The optimum number of processors 41
5.8 The optimum number of processors with a new column division scheme 43
5.9 Estimated parallel running time for each communication model 45
5.10 Running time estimates of the revised (MINOS), serial (retroLP) and parallel
Full-tableau algorithms (dpLP) 46
5.11 Advantage of the Steepest Edge column choice rule 55
5.12 Memory requirements of revised and full tableau method 57
5.13 Asymptotic (computation/communication ratio change) analysis 59
5.14 Sensitivity to s, π and p 62
6. Implementation Choices --------------------------------------------------------------------72
6.1 Distributed programming software 72
6.2 Sockets 73
6.3 Reasons for choice of both sockets and MPI 73
6.4 The specific commands used 74
xi
6.5 A brief description of MINOS, a revised simplex implementation 75

7. Computational Experiments---------------------------------------------------------------78
7.1 Problems used for experimentation 78
7.1.1 Synthetic linear programs 78
7.1.2 Netlib Problems 83
7.2 Validation of Performance Models 86
7.2.1 Computation verification 86
7.2.2 Communication verification (using regression for coefficients) 93
7.2.3 Analysis of the revised program (MINOS) expression 98
7.2.4 Revised vs. retroLP and dpLP 101
7.3 Total Time Comparisons 105
8. Summary, Applications and Future Work ------------------------------------------- 110
8.1 Summary 110
8.2 Applications with dense matrices 111
8.3 Future work 112
Appendix A. Form of Linear Program input: LPB and MPS---------------------- 115
A.1 Preprocessing into LPB format 115
A.2 The MPS format 119
Appendix B. Program description: retroLP and dpLP------------------------------ 123
B.1 retroLP - the serial implementation 124
B.2 dpLP – the parallel implementation 125
References ----------------------------------------------------------------------------------------- 128
xii
Figures
Figure 3.1- Full tableau vs. the revised methods......................................................... 17
Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000) . 20
Figure 5.1- time per iteration as n increases................................................................ 45
Figure 5.2 – Time per iteration (m=1,000).................................................................. 49
Figure 5.3-Time per iteration as a function of density................................................ 51
Figure 5.4-Time per iteration as a function of aspect ratio ......................................... 52
Figure 5.5- Time per iteration as a function of m........................................................ 52
Figure 5.6 – Time per iteration as a function of p....................................................... 53
Figure 5.7 – Time per iteration as a function of p....................................................... 54
Figure 5.8 – Memory (in Megabytes) ......................................................................... 59
Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move
together................................................................................................................ 61
Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 .
Computation is changing..................................................................................... 62
Figure 5.11 – Time per iteration as a function of relative error in s ........................... 70
Figure 5.12 - p* as a function of relative error in s ..................................................... 70
Figure 5.13 - Time per iteration as a function of relative error in π ........................... 71
Figure 5.14 – p* as a function of relative error in π.................................................... 71
Figure 7.1 – Total time by generator vs. density......................................................... 81
Figure 7.2 – Time per iteration by generator vs. density ............................................ 82
Figure 7.3 – Iteration time vs. mn (classical column choice rule) .............................. 90
Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule) ............. 91
Figure 7.5-Actual timing as a function of Density.................................................... 103
Figure 7.6- Actual timing as a function of p ............................................................. 103
Figure 7.7 – retroLP vs. MINOS............................................................................... 107
Figure 7.8 – Speedup relative to MINOS (m=500, n=1,000).................................... 108
Figure 7.9 Speedup relative to MINOS (m=1,000, n=5,000).................................... 109
Figure A.1.................................................................................................................. 120
xiii
Tables
Table 5.1-Coefficient estimates................................................................................... 38
Table 5.2 – Coefficient estimates used in models....................................................... 38
Table 5.3- Expressions of the six models.................................................................... 39
Table 5.4 – p* and optimal time per iteration ............................................................. 39
Table 5.5 – Time per iteration for m=1,000 ................................................................ 46
Table 5.6- estimated running time per iteration .......................................................... 48
Table 5.7-p* and T as a function of relative error in s ................................................ 70
Table 5.8- p* and T as a function of relative error in s ............................................... 71
Table 7.1 - Netlib problems sorted by density ............................................................ 85
Table 7.2- percentage errors in both groups of problems. .......................................... 96
Table 7.3-The 24 problems used ................................................................................. 97
Table 7.4- Actual timing as a function of Density .................................................... 104
Table 7.5- Actual timing as a function of p............................................................... 104
Table A.1-Data structure for retroLP and dpLP........................................................ 117
1
1. Introduction
The simplex algorithm of linear programming has been cited as one of "10
algorithms with the greatest influence on the development and practice of science and
engineering in the 20th century," [Dongarra and Sullivan, 2000]; however, there are
few effective parallel or distributed implementations. By changing from today's
popular form of the simplex method, the "revised" form, to the earlier "standard"
form (full tableau) we have been able to implement an effective coarse grained
distributed algorithm which is a simple extension of the standard form of the simplex
method. We thus reexamine the original version, the full tableau method of the
simplex algorithm. Today, virtually all serious implementations are of the revised
method because it is much faster for sparse LPs which are most common. However,
the full tableau method has advantages as well. First, the full tableau can be very
effective for dense problems. Second, as we have already mentioned, a full tableau
method can easily and effectively be extended to a coarse grained distributed
algorithm. The distributed computational environment we have in mind is p identical
dedicated workstations connected homogeneously by a broadcast network (Ethernet).
A natural approach to a distributed simplex method is to partition the columns
of the linear program among processors. This has several implications. All activities
performed on columns are essentially reduced linearly. This in turn suggests the use
of the full tableau simplex method in place of the more standard revised simplex
method. In the tableau method no processor needs to keep a copy of the basis or its
inverse. Moreover, the work of updating the columns can be done in parallel. We
wrote a serial linear programming code based on the full tableau. Then, with the
2
addition of a few simple communication functions, a distributed version was
constructed.
The revised method is faster for sparse problems. For dense problems, though,
the full tableau method may perform better. This is because the revised method must
calculate data that the full tableau would already have. For a comparison of the
computation costs see Section 3.5.
Our method is good for dense problems even when using the serial program.
Our parallel program is good for large problems, sparse or not.
One issue that must be addressed is how to give each processor enough
computation so that the communication latency of a distributed system would
amortize. One way is to pick better columns by using alternate column choice rules.
These alternate rules cost more in computation than the standard rule. On the other
hand, these rules may reduce the number of iterations, see [Wolfe and Cutler, 1963],
[Kuhn and Quandt, 1963], [Goldfarb and Reid, 1977], [Forrest and Goldfarb, 1992]
and [Fletcher, 1998]. By pricing out columns in parallel, the extra cost can be
amortized over the processors.
We first develop performance models for communication and computation.
We then analyze the running time, including computation and communication. We
characterize the optimal number of processors to minimize running time as a function
of size, aspect ratios and density. We also see which column choice rule would be
effective in this method. The dissertation is organized as follows. Section 2 lists
related work both in parallel implementation of the simplex method and in other
methods of solving linear programs. Section 3 gives a review of the simplex method.
3
Section 4 adds to the motivation of our work. It explains in more detail the types of
problems for which this method is applicable and how, with this method, different
column choice rules can be employed. Section 5 explains in detail the communication
models and their analysis. It shows how well our method does in comparison to other
methods, including the revised simplex method, as represented by MINOS. Section 6
explains how we implemented our serial and parallel method. Section 7 details
experiments on actual problems using our implementation of the method. Section 8
offers conclusions and possibilities for future work. Two appendices follow; one
specifies the MPS input format our linear program packages accept and the other
describes both the serial and parallel linear programming packages.

4
2. Related Work
2.1 Interior point methods
The simplex algorithm is not the only way to solve a linear program. There
are other methods. The main competitors are a group of methods known as interior
point methods. Some interior point methods have polynomial worst case running
times, which are less than the exponential worst case running time of the simplex
method [Nash and Sofer, 1996 pgs. 269-278]. On average though, the simplex
method is competitive with these methods. With the simplex method, post-run
analysis is also possible [Nash and Sofer, 1996]. This dissertation focuses on the
simplex method.
2.2 Parallel implementations for the simplex method
Parallel implementations for general linear programs have been based on the
revised method [Hall and McKinnon, 1997] and the full tableau method [Eckstein et
al, 1995]. The parallel implementation of the full tableau used "stripe arrays" on the
Connection Machine CM-2. This implementation is fine grain and machine
dependent. The approach we use is coarse grained and can be applied to distributed
systems. Another implementation used a hypercube network [Stunkel and Reed,
1988]. Stunkel and Reed also used the full tableau form of the simplex method. They
actually compared two ways of partitioning the constraint matrix amongst the
processors on a hypercube. The first way is similar to our partitioning scheme where
groups of columns are partitioned and given to different processors. The second way
is to partition the rows.

5
A few of the differences between our paper and the hypercube implementation
are
a. Our method is for Ethernets, which provides broadcast
communication.
b. We provided performance models, which allow us to determine the
optimal number of processors.
c. We considered alternative column choice rules such as the steepest
edge rule.
d. We compared our method to the revised method for dense and sparse
problems.
e. A fast parallel program, that gives an approximate solution for a
special case of linear programs, is described in [Luby and Nisan,
1993]. CPLEX also implemented their simplex code on parallel
machines based on the dual simplex method [Bixby and Martin, 2000].
6
3. Review of the Simplex Method
3.1 A short review of linear algebra
We first look at a system of linear equations in two representations:
Ax = b
or
åa
j =1, m + n
i, j x j = bi i = 1,..., m
We assume, for now, that A is m x (n+m) and of full rank. Thus there is a set
of m linearly independent columns of A, which we call a basis.
By renumbering columns we can combine these columns into a non-singular
square submatrix B and write A = [N|B] with B non-singular
éx ù
We also write x = ê N ú
ë xB û
Then we have: NxN + BxB = b ⇔ Ax = b
Since B is non-singular:
B-1Ax = B-1 [N|B]x = B-1NxN + B-1BxB = B-1b, or
NxN + IxB = b, or
IxB = b - N'xN, where
N = B-1N, and b = B-1b.
In the other notation, IxB = b - NxN becomes:
n
x n +i = bi − å aij x j (i = 1,2,..., m) (3.1)
j =1
7
The variables xB = [xn+1, xn+2, ... , xn+m] are basic (dependent), and the
variables xN = [x1, x2, ... , xn] are non-basic (independent).
For any assignment of non-basic variables, we use (3.1) to determine basic
variables in a solution.
Following Chvátal [1983], we call this a dictionary representation.
If we are given a dictionary (3.1) with assignments of the non-basic variables
satisfying their constraints, so that the resulting values for the basic variables using
(3.1) happen to satisfy their constraints, we say that the dictionary and the non-basic
assignments are feasible.
This section (3.1) explained how a system of equations could be converted
into dictionary form. The following section continues the discussion with the addition
of an objective function.
3.2 Definition of a Linear Program
The Linear Programming terminology used here is similar to that used in
books detailing the full tableau method. One such book is [Chvátal, 1983].
We start with:
Maximize z = cx
x
Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., n
As before we assume that we have a basis, B, to put this in dictionary form:
n
x =b − å a x (i = 1,2,..., m)
n+i i ij j
j =1
We can also use this to eliminate all the basic variables in z = cx.
This leads us to the dictionary representation of Chvátal [1983].

8
n
Maximize z = c0 + å c j x j
x
j =1
n
Subject to x = b ' − å a ij ' x (i = 1,2,..., m) (3.2)
n+i i j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n
Since this is the representation we will use from now on, we drop the primes
from the coefficients. The dictionary is said to be feasible for given values for x1,…,xn
if the given values satisfy their bounds and if the resulting values for xn+1,…,xn+m
satisfy theirs.
Suppose our dictionary besides being feasible has the following optimality
properties:
(i) for every non-basic variable xj that is strictly below its upper bound we
have cj ≤ 0, and
(ii) for every non-basic xj that is strictly above its lower bound we have cj ≥ 0.
Such a dictionary is said to be optimal. It is easy to see that no change in the
non-basic variables will increase z and hence the current solution is optimal.
In the next example we assume that B = I Þ B −1 = I . We take a linear
programming problem and convert it into an initial, possibly infeasible, dictionary. A
procedure called Phase I converts an infeasible dictionary into a feasible dictionary.
Details of Phase I are not discussed here. See [Nash, 1996] or [Chvátal, 1983].
Given a problem of the form

9
Maximize c1 x1 + c2 x2 + L cn xn
Subject to a11 x1 + a12 x 2 + L a1n x n op b1
a 21 x1 + a 22 x 2 + L a 2 n x n op b2
M
a m1 x1 + a m 2 x 2 + a mn x n op bm
where op refers to any of the relations = , <= or >= and the
var iables can be bounded from above and below.
We first add slack, surplus and artificial variables to the constraints. This
results in the following form:
Maximize c1 x1 + c2 x2 + L cn xn =
Subject to a11 x1 + a12 x 2 + L a1n x n + s1 = b1
a 21 x1 + a 22 x 2 + L a 2 n x n + s2 = b2
M O
a m1 x1 + a m 2 x 2 + a mn x n + sm = bm
where slacks <= 0, surpluses >= 0 and artificials = 0
We then solve for the s variables:
Maximize z = c1 x1 + c2 x2 + L cn xn
Subject to s1 = b1 − a11 x1 − a12 x 2 L − a1n x n
s2 = b2 − a 21 x1 − a 22 x 2 L − a 2n xn
M
sm = bm − a m1 x1 − am 2 x2 L − a mn x n
where slacks <= 0, surpluses >= 0 and artificials = 0
which gives us a “Dictionary” [Chvátal, 1983].
The "a" values can be put into a tableau (matrix). Appendix A gives an
example of this transformation.
This and the previous section (3.1 and 3.2) explained how to put a linear
program into dictionary form. No pivots were necessary. The following section
explains the simplex method optimization process, which employs pivoting.

10
3.3 Short description of the full tableau simplex method
The simplex method operates by performing iterations on a feasible
dictionary. Each pivot "increases" the objective until it is optimal, it is shown to be
unbounded or it is shown to be infeasible. An earlier implementation of the full
tableau simplex method is provided in [Press, 1992].
In order to get an initial feasible solution, another linear programming
problem is solved. This linear program uses the same procedure as explained here. It
uses a different objective on the same dictionary. This first linear program is known
as Phase I. The linear program we need to solve, which uses the real objective, is
known as Phase 2. More details on Phase 1 can be found in linear programming
books, for example [Chvátal, 1983].
Within each iteration there are three steps:
1) Choose a non-basic variable (column choice):
Select a non-basic variable xj, with its cost coefficient cj>0, in equation
(3.2) that is not at its upper bound, or one with its cost coefficient cj<0 and not
at its lower bound. There may be many such eligible choices. For now, any
will work. See the discussion below for possible alternatives. If the largest
absolute value is 0 you are at the optimum and should exit.
2) Determine which variable becomes non-basic (row choice):
As we modify the non-basic variable in the direction determined in step 1)
three possibilities exist:
i) A bound of the non-basic variable of the winning column is the first to be
violated. Let the variable go to this other bound.

11
ii) The basic variable in row i is the first to violate its bound. Pivot using the
violated row.
iii) No constraint is violated. The problem is unbounded.
3) Perform a pivot or move a non-basic variable from one bound to its other
bound.
The simplex just described uses the classical (Dantzig) column choice rule in
step 1. This is the most commonly used column choice rule. Other column choice
rules are possible.
[Wolfe and Cutler, 1963] and [Kuhn and Quandt, 1963] were early studies of
column choice rules. These studies evaluated how different column choice rules
affect running time. The issue was the tradeoff between using relatively complex
(slow) choice rules to reduce the number of iterations and the resulting increased
running time per iteration. In addition to the classic rule introduced by Dantzig the
“greatest change rule" and the "steepest edge rule" were among those tested. The
results were that while the more complex column choice rules would decrease the
number of iterations, the cost of applying those rules was too great for the problems
they studied. The extra computation required took away any speed-up gained by the
reduction in pivot iterations. They performed the tests using the full tableau method.
Today the revised method is more common.
Harris [1973], Goldfarb and Reid [1977], Forrest and Goldfarb [1992] and
Fletcher [1998] studied how to implement the Steepest Edge rule efficiently in the
revised method. They stored extra information in order to keep a recurrence formula
updated. They report an overall gain in computation speed when using the steepest
12
edge rule instead of the classic rule. Another column choice rule that is only feasible
to implement in the full tableau method is the “Greatest Change Rule." These three
general classes of column choice rules were implemented in our codes and are
explained in more detail below.
Classical (Dantzig): Take the eligible column with maximum |c'j|; its
complexity is basically a constant number of additions/subtractions per
column, or order n for the entire process. In the full tableau method the current
objective row is readily available.
Steepest Edge: For each eligible column, divide |c'j| by the length of the
column of A, 2
å a'i, j , and take the largest of these quotients. For increased
i
efficiency, one actually considers c' 2j å a'

i
2
i, j . The complexity here is m+1
multiplication/divisions, and a similar number of additions/subtractions, worst
case, per column evaluated. It is important to note that this calculation is only
necessary for “eligible” non-basic variables candidates that would increase the
objective value if brought into the basis. This is discussed further in Section
7.2.1.
Greatest Change: For each eligible column, actually compute how much the
objective function would improve if the column were introduced, and select
the one which would cause the greatest change. The complexity here is
essentially that of the row choice rule, order m multiplications/divisions and

13
additions/subtractions per column evaluated. Again, The full calculation is
only needed for eligible columns.
The greater amount of work for the steepest edge and greatest change rules are
more than made up for by the reduction in the number of iterations using the
standard method. However, the greatest change rule is very hard to use in the
revised method. The steepest edge is moderately difficult to implement in the revised
method; special techniques are required [Goldfarb and Reid, 1977], [Forrest and
Goldfarb, 1992].
The steepest edge or greatest change rule is definitely worthwhile in our
distributed implementation (see Section 4.1.2).
The full tableau (matrix) method, unlike the revised simplex method, stores
and pivots on the whole tableau (matrix) of m rows and n columns. The full tableau
method uses information from the top row, which is the objective vector, for the
standard column choice rule. If other column choice rules are used additional
information is needed from the rest of the tableau.
3.4 Short description of the revised method
The Revised Simplex Method is commonly used for solving linear programs.
This method operates on a data structure that is roughly of size m by m instead of the
whole tableau. This is a computational gain over the full tableau method, especially in
sparse systems (where the matrix has many zero entries) and/or in problems with
many more columns than rows. On the other hand, the revised method requires extra
computation to generate necessary elements of the tableau: the current cost

14
coefficients and the entering pivot column for the column choice and the row choice
respectively. These are computational costs the full tableau method doesn't have.
In the full tableau method we started with:
Maximize z = cx
x
Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., m + n
then using a basis B, and its inverse, we obtained a dictionary form:
n
Maximize z = c0 + å c' j x j
x
j =1
n
Subject to x = b' − å a ' x (i = 1,2,..., m)
n+i i ij j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n
In the revised method, the second form is represented implicitly in terms of
the original system together with a functional equivalent of the inverse of the basis B
rather than explicitly in the dictionary form. "Functional equivalent" means we have a
data structure which makes solving πB = cB for π and BA’ j = Aj for A’ j, easy. Aj
represents the jth column of the A matrix and cb represents the basic objective
coefficients. The data structure need not be B-1 or even necessarily a representation of
it. For example, an LU decomposition of the basis is often used, see [Nash and Sofer,
1996 pgs. 218-222]; another is to represent the inverse as a product of simple pivot
matrices see [Nash and Sofer, 1996] and [Chvátal, 1983 Ch. 7]. Given the implicit
representation, we recreate the data needed to implement the three parts of the
simplex iteration. Thus, to "pivot" we update b’ and our functional equivalent of B-

1
.We now sketch the steps of the revised simplex method.
Select Column:
15
We must have the coefficients c'j available. We use multiples of the
original constraints to eliminate the basic variables in the expression for z.
Symbolically, we let the component row vector π represent the multiples; that
is, we multiply constraint i by πi and subtract the result from the expression
for z. To make this work π must have πB = cB where cB, as above, represents
the m elements of c corresponding to the basis columns. Then c'j in the
dictionary is given by c' j = c j − π A j where Aj represents the jth column of
A.
This computation is called pricing. So to select the column using the
classical Dantzig rule, the vector π must be calculated and then the inner
product of π with each column of A must be subtracted from the original
coefficient.
The revised method takes more effort than the standard simplex
method in this step. However, for sparse matrices, pricing out is speeded up
because many of the products have zero factors. Moreover, the revised
simplex method can be speeded up considerably by using partial pricing
[Nash and Sofer, 1996 p. 222]. Partial pricing is a heuristic of not considering
all the columns during the column choice step. On the other hand, alternate
column choice rules such as steepest edge (however see [Forrest and
Goldfarb, 1992]) and greatest change are much more difficult to implement
using the revised approach.
Select Row:
16
To implement this we need b' and the column from the dictionary that
we chose in Step 1, A' j = (a'1s, a'2s, ..., a'ms)T. The b' vector will be updated
from iteration to iteration; it does not need to be recreated. A' j is given by
solving BA' j = Aj.
Here, sparsity in A pays off again. In the standard simplex method, as
we go from dictionary to dictionary we quickly loose whatever sparsity there
was in the original matrix A. In the revised method since we always go back
to the original matrix, we still have the original sparsity. Specifically, we have
the original sparsity in Aj.
Pivot:
If we need to pivot, instead of explicitly pivoting in a dictionary as before,
we update our functional equivalent of B-1.
3.5 Running time comparison of the revised method and the full tableau
method
The revised simplex method is especially efficient for linear programs that are
sparse and have high aspect ratio (n/m). A linear program is sparse if most of the
elements of the dictionary are 0, and it has a high aspect ratio if n/m is large.
Updating any of the representations used by the revised method is usually, at worst,
of order m2. On the other hand, pivoting on the explicit representation of the
dictionary takes order mn. Thus for high aspect ratios the standard method takes much
more work. Fortunately, in our distributed method, this work is done in parallel with
linear speedup in a straightforward way. This will be made clearer as we derive
performance models for the revised and full tableau methods later in this section.
17
Figure 3.1 compares advantages and disadvantages of the full tableau vs. the
revised methods.
Revised Simplex Method Standard Simplex Method

Takes better advantage of sparsity in problems Is more effective for dense problems
Is more efficient for problems with large aspect Is more efficient for problems with low aspect
ratio (n/m) ratio.
Can effectively use partial pricing Can easily use steepest edge, or greatest change
pricing in addition to the classic choice rule.
Is difficult to perform efficiently in parallel, Very easy to convert to a distributed version
especially, in loosely coupled systems. using a loosely coupled system.
Frequently, the functional equivalent of the basis Rarely, the dictionary has to be recomputed from
inverse is recomputed both for numerical stability the original data to maintain numerical stability
and for efficiency (e.g., maintaining sparsity). (but not for efficiency). The work is substantial.
The work is modest.
Figure 3.1- Full tableau vs. the revised methods
We will now give expressions to estimate the time per iteration of the revised
and full tableau methods. For simplicity we will count only multiplication and
division operations. The full tableau method using the classical column choice rule
requires m operations for the ratio test and (m+1)(n+1) for the pivot. Column Choice
requires n comparisons (insignificant cost) for the classical column choice rule and
(m+1)n (multiply/divides) for the steepest edge rule. Initially, we consider the
classical column choice rule and will therefore ignore column choice cost. This totals
mn+2m+n+1 operations per pivot iteration.
We can safely choose the iteration time estimate based on the explicit inverse
as an upper bound on the true value, since the more exotic functional equivalents of
the basis inverse would not be used if the performance were not better.
18
The revised method using the explicit inverse requires m2 to determine π, mn
floating point multiplication/division and addition/subtraction operations to price out
the current objective row, m2 to compute the entering column, m for the ratio test and
(m+1)2 for the pivot. This totals mn+3m2+3m+1. This assumes an explicit inverse;
this can be reduced significantly for sparse problems or if we use a more effective
functional equivalent of the basis inverse such as LU decomposition.
Assuming a dense matrix, the time per iteration (measured by the number of
multiplications and divisions) of the revised and full tableau methods are equal when
n=3m2+m. The last line of Table 5.6 in Section 5.7 shows an example. If n<3m2+m
the full tableau method should be faster. If n>3m2+m the revised method should be
faster. That means that for dense problems the full tableau method requires less
computation than the revised method for n<3m2+m. This analysis is for the revised
using the explicit inverse representation. For other representations the analysis
doesn’t apply on a per iteration basis.
As just explained, the revised method pivots on a portion of what the tableau
method pivots on; m columns instead of n columns. On the other hand, it must
calculate the current pivot column and objective row from the original problem. If the
original problem has zeros the revised need not multiply those zero elements. One
cannot count on the pivot itself being shortened since the current pivot column is not
part of the original (sparse) data. Similarly the full tableau method can't take
advantage of this sparsity since it pivots on derived data instead of the original data.
Since the revised method does not do operations on zero elements, there are
potential savings for sparse problems. Let d (for density) be the average number of
19
nonzero elements. For example if we assume that on average each column has 5% of
its values nonzero, then d=5%. Determining π, pricing out and calculating the
entering column will take about dm2, dmn and dm2 respectively. The pivot and ratio
tests still require roughly the same number of operations as before: m and (m+1)2
respectively. The total running time of a sparse problem on the revised method using
the explicit inverse is approximately 2dm 2 + m 2 + dmn + 3m + 1 . Assuming an
extreme of complete sparseness, where d is 0, the running time is m 2 + 3m + 1 . It
equals the running time of the full tableau method when n=m. That means that in
complete sparseness the revised is faster as soon as n gets larger than m, which is the
usual case. A similar discussion to this one can be found in Nash and Sofer [Nash and
Sofer, 1996 p. 115].
For example, assume m=1,000 and there is 5% density (d). The running time
of the revised method can be estimated as
50n + 100,000 + 1,000,000 + 3,000 + 1 = 50n + 1,103,001 . The running time of the
tableau method is 1,000n + 2,000 + n + 1 = 1,001n + 2,001 . The running times of the
m(m(2d + 1) + 1)
revised and full tableau methods are equal when n = . When n=1,158
m(1 − d ) + 1
the revised and tableau methods take about the same time. When n<1,158 the tableau
method is faster. Once n>1,158 the revised method is faster. Figure 3.2 is a graph that
shows for m=1,000 at what n the revised method overtakes the tableau method. This
is shown for varying density.

20
4500
4000 3994
3500 3450
3000 2997
2613
2500
2284
n
2000 1999
1749
1500 1529
1333
1158
1000 1000
500
0
0 0.1 0.2 0.3 0.4 0.5
Density, d
Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000)
21
4. Motivation for a serial and distributed full tableau implementation
4.1 Why the method is applied to full tableau
Aside from the two early papers mentioned, recent research has focused on
the revised method. We use the full tableau method. There are a number of reasons
for this. The two main reasons are that it is better for dense problems (Section 4.1.4)
and that a coarse grained distributed program is straightforward (Section 4.1.3). More
specifically:
4.1.1 No pricing out or column reconstruction
As stated above the revised method requires extra computation to calculate
(price out) the objective row. It also requires extra computation to calculate the
column entering into the basis. Using the full tableau method avoids this computation.
4.1.2 Alternate column choice rules can easily be used
Another advantage of using the full tableau method is the possibility of using
multiple column choice rules. These were briefly mentioned in Section 2.1 and in
Section 3.3. The classical (Dantzig) rule simply selects the column with the maximum
coefficient in the objective row of the updated tableau. It is easy to use this rule. The
cost of deciding which column will enter the basis, using the Dantzig rule, is only n
comparisons. The revised method as mentioned in previous sections must determine
the multipliers and price out; a cost of m2+mn.
Other column choice rules, on the other hand, require the values of the
complete nonbasic column in addition to the value of the objective coefficient.
The full tableau method allows the use of these other column choice rules
without the extra computational cost of recreating those columns. Its only additional
22
cost is m multiplications per column. This computation is amortized over the different
processors in dpLP.
The revised, on the other hand, does not keep the updated column on hand.
(Note the m2 cost of computing the entering column.) To use these other rules the
revised must recompute every column (not only the entering column) from the
inverse, instead of just pricing out the objective row. In addition to the mn cost of
pricing out it would now cost m2n to reconstruct all n columns. Reid, Goldfarb and
Forrest [1977,1992] addressed this problem for the Steepest-edge rule. They used a
recurrence formula to update the steepest edge direction (norm). There is a substantial
cost to initializing this formula. In addition, each iteration takes longer due to the cost
of updating this recurrence formula. They have reported that using this Steepest Edge
rule generally cuts down the total computation time for any problem size; that is, it
reduces the number of iterations enough to more than compensate for the added per-
iteration cost. A few of the rules that can be used with the full tableau method
include the greatest change method and different gradient methods. Wolfe-Cutler
[1963] and Kuhn-Quandt [1963] have studied these column choice rules using the full
tableau method. They each took rather small linear programming problems and ran
them using various column choice rules. They calculated the time per iteration and
the number of iterations for each run. From these runs they concluded that on average
the greatest change method and different forms of the steepest edge method result in
fewer iterations. The extra cost per iteration, though, was too costly and made the
algorithm as a whole slower when used with these alternate rules. We did similar tests
on our serial implementation on large problems and did find a gain in computation
23
time for alternate rules. We compare the standard rule to the steepest edge rule in
Sections 5 and 7.
4.1.3 The inverse does not have to be kept in each processor
Our distributed method divides the columns amongst the processors. Using
the revised method requires part of the tableau to be recreated from the inverse each
iteration, which makes a parallel method difficult. At least one processor would need
a copy of the whole m-by-m inverse or its functional equivalent. The full tableau
method does not need to recreate any row or column and can simply hold as many
columns as it wants without the extra inverse overhead. Using the full tableau method
makes it unnecessary for any processor to carry the inverse of the basis.
4.1.4 Dense problems do not gain by use of the revised
Our method works for all problems, but it performs best when used on dense
problems. This is because the revised method is slower than the full tableau method
for dense problems even on one processor, assuming n is not too much greater than
m, as was noted in Section 3.3. For dense problems there is no point in using the
revised method. Applications that give rise to dense problems are given in Section
8.2.
4.2 Distributed computing
Distributed computing is the use of multiple computers communicating over a
loosely connected network. They do not share memory but communicate via message
passing [Maekawa, 1987].
4.2.1 Why use distributed over MPP – no need for a supercomputer
A distributed system has a number of advantages over a massively parallel
machine. It can be composed of readily available computers. A distributed program

24
can be run on a distributed system of dedicated computers. This would mimic a
supercomputer. It can also be run on any network of workstations that is not using all
its CPU resources. A special supercomputer is unnecessary.
4.2.2 Why coarse grain parallelism is used
When using a distributed network, special attention must be paid to the
tradeoff between communication time and the time spent by the processors doing
useful work. The physical communication among processors in a distributed system is
slower than among processors in a supercomputer. Thus, while relatively fine-grained
parallelism can be used with supercomputers, in a distributed system only coarse-
grained parallelism can be used effectively. The high communication time to
computation time ratio in message passing systems is much higher than that for
supercomputers. This has significant implications for our work.

25
5. Models and Analysis
5.1 Synchronized parallel pivots
Our general approach is to divide the n columns of the problem amongst p
processors. There is no overlap of columns. All vectors that hold information for the
non-basic variables are also divided. The x vector, which holds the values of the non-
basic variables, is therefore divided. The b vector, which holds the values of the basic
variables, is completely copied to all the processors. At each iteration, every
processor calculates the best candidate for the pivot column from among its columns,
using one of the column choice rules. For each column choice rule that we have
discussed, a numerical value is assigned to each column by the rule, and the column
with the maximum value is chosen by the rule. In parallel, each processor looks at its
columns. The best of these values determines the local column chosen. This column
is that processor’s proposal. A coordinating processor then calculates the best value
from among the proposals. In our (first) implementation, only one column is proposed
per processor although generalizations to multiple proposals are easy to design (see
Section 8.3). The processor with the winning proposal sends its column to all
processors who then pivot on their columns (explained below). This pivot is the last
step of the iteration. The iteration is repeated until optimality, infeasibility or
unboundedness is detected. Program details are left to the appendices.

26
Communications required for parallel programming

Our parallel programming algorithm requires the following communication
among the processors.
1. Initialization
At the beginning of processing before any iteration, each participating
processor must be sent the initial value of the basic variables, b, and a
subset of the columns of the problem (approximately n/p columns
each).
2. Column Proposal
Every iteration, each processor must make known the value of its best
column according to the column choice rule being used.
3. Pivot Column Broadcast
Every iteration, the best candidate among the proposals of the
processors is selected; the complete winning column must then be sent
to each processor.
4. Finalization
At the end of the processing, after all iterations, the information
associated with the non-basic variables at each processor must be
consolidated to provide the final solution.
Since Steps 1 and 4 are executed only once, we therefore give them less
consideration than Steps 2 and 3, which must be performed each iteration.
There are two implementations of Steps 2 and 3, which we refer to as
centralized and decentralized respectively.

27
In either approach, Steps 1 and 4 are both coordinated by a distinguished
processor. In the centralized approach this processor also coordinates Steps 2 and 3.
That is, each processor sends the value of its column proposal to the coordinator. The
coordinator selects the winner, and tells the winning processor to send its column to
all the other processors. In the decentralized approach, each processor broadcasts its
proposal to all the other processors. Simultaneously, all processors can determine the
winner; specifically the winning processor is able to determine that it is the winner. It
can then broadcast its column to all the processors. No coordination is required.
A broadcast communication medium such as an Ethernet or token ring local
area network can facilitate this processing. For example, the algorithm requires only
one broadcast transmission of order m doubles to accomplish step 3. (Getting
application layer communication facilities to broadcast in order m time is not easy;
see Section 6.2.). Step 2 using the centralized approach requires p point-to-point
transmissions of essentially one double to the coordinator. The coordinator then sends
a point-to-point transmission to the winning processor. In the decentralized approach,
this is replaced by p broadcasts of a double one from each processor.
5.2 Short sketch of one distributed pivot step
A more complete program description is provided in the appendices.
Assume we are in phase 2:
a) Each processor selects and sends its local best column to the main
coordinating processor. The main processor calculates the global max. If no
column can be selected we are at the optimum, and Phase 2 is over.

28
b) The processor with the global max selects the variable (row) to leave the basis
and ships a copy of its winning column together with its row choice to every
processor.
c) Next, there are 3 possibilities, as in every simplex algorithm:
i) A bound of the non-basic variable corresponding to the winning column is
the first to be violated. Let the variable go to this bound.
ii) Row i's constraint is the first to be violated. We then pivot using the
violated row.
iii) No constraint is violated. The problem is unbounded.
5.2.1 Choice of four models of parallel communication
Several models of parallel computation were investigated. One is the LogP
model proposed by Culler et al [1993]. Another was the BSP model proposed by
Valiant [1990]. In addition to the previous two we looked at simple models assuming
the use of local area networks (LAN’s) such as Ethernets or Token Rings for
communication. Most LAN’s are intrinsically broadcast devices, however not all
software for using them in distributed computing takes advantage of that capability.
We considered LAN communication both with and without broadcast primitives. The
LogP model is a model for asynchronous computation whereas the BSP is a model for
synchronous computation. In choosing which model to use, we had to look at our
algorithm to see how synchronous it is. We also had to determine the size of the
computational chunks performed independently by the individual processors. The
matter of whether an algorithm uses coarse or fine grain parallelism influences the
choice of a communication model. All the models were used in the analysis, although
29
we modified the LogP model slightly. In our testing, detailed in Section 7, only the
Ethernet with broadcast is used. The following paragraphs explain this in detail.
As just mentioned, four parallel models were used to analyze the program.
These will be listed later in this section in Table 5.3. The first is a model of Ethernets
using a broadcast primitive. The second is a model of Ethernets not using a broadcast
primitive. These were chosen due to the common use of these topologies. The other
two models assume a completely connected topology. They are the BSP model and
another commonly used (CU) model. This commonly used model will be used instead
of the LogP model for two reasons. The LogP model is complex and it overestimates
the running times for programs that send large messages. Our program broadcasts
vectors which can be quite large and which can cause LogP to give bad estimates.
Whereas the CU model is asynchronous like LogP, BSP assumes synchronous
supersteps. Our algorithm has a few supersteps which makes BSP an interesting
model to use. By using both these models, we can see how long it should take both on
a synchronized system and on an asynchronous system.
5.2.2 Basic explanation of communication parameters
In the discussion below we are sending m double precision floating point
numbers across a network that has p processors. For illustration, we assume m to be
1,000 and p to be 100. The discussion would also apply to all m and p.
For an Ethernet we assume a simple two-parameter model of a LAN. If 1000
items are to be sent across the network within one message it takes s+1000g time. s
represents the startup time (latency) and g represents the items/sec (throughput). No
two communications can take place simultaneously. If the Ethernet supports a
broadcast primitive, sending the same 1000 items to p processors takes s+1000g; if
30
broadcasts are not supported then (99)(s+1000g) time is required. We ignore the
effects of collisions and retransmissions, and assume the Ethernet is lightly loaded.
The LogP model has three parameters: l, o and h. The definition of these
parameters is subtler than for those above. The parameter l is the time it takes for an
item to go from processor to processor over the network. Typically this is extremely
short. o is the operating system time taken by a processor to send or receive an item.
To send one item would cost l+2o. h is the time the processor must wait before
sending the next item. In this model, processor A can begin sending to another
processor C after sending to processor B before B has completely received its data.
Whether to send 10 items from one processor to another or to send one item from one
processor to 10 other processors (broadcast) takes l+2o+10h. Clearly, sending 10
items as a group will be faster on most architectures, which is why LogP will
overestimate the time for long transmissions.
The commonly used model (CU) has 2 parameters: s and g. These are the
same as in the Ethernet model. The difference is that different groups of processors
can communicate to each other simultaneously, whereas in the Ethernet only one
processor can successfully communicate at a time. Unlike the LogP model, in the CU
model a processor can't begin a session until its previous session is finished. To send
1000 items takes s+1000g.
The BSP model has 2 parameters: L and g. g is the same g as in the last model.
L is different than s. In the BSP, communication is synchronized. For each
synchronized period called a superstep one takes the maximum number of items any
one processor communicated (a send or receive) multiply it by g and add it to L. For

31
processor 1 to send 10 items to each of 100 processors in one superstep takes
L+1000g, since processor 1 sent a total of 100*10 items.
5.2.3 Analyzing basic communication operations
Our program uses Datagram sockets to take advantage of the Ethernet’s
broadcast facility. Our program has two basic communication segments that
correspond nicely to two basic parallel communication primitives, Allreduce and
Bcast, from MPI [Snir, 1996]. MPI is a parallel communications package that allows
many processors to communicate with one another. Available implementations,
unfortunately, do not take advantage of the Ethernet broadcast facility. This
motivated us to do our own parallel implementation using socket programming.
Section 6 describes MPI as well as the original reason for its choice as the parallel
package before sockets were used. For ease of reference, the two MPI primitive
names will be used for what we implement using sockets. Below is a brief analysis of
the Bcast and Allreduce MPI primitives. For a more in-depth analysis see Karp et al
[1993], who discusses the complexity of these operations.
Allreduce gets one element from each processor to a "root" processor. (This
first step is called Reduce.) This “root” calculates the maximum of these and
broadcasts the maximum to all the other processors. For a completely connected
network topology, Reduce has been shown to take the same time as Bcast since it
involves messages between the same endpoints in the opposite direction [Karp et al,
1993]. Allreduce takes at most 2*Bcast time. This bound is easily achieved by
executing Reduce followed by Bcast. Karp et al [1993] shows that optimally
Allreduce takes no longer then Reduce.

32
On an Ethernet, such as our network, Reduce is slower than Bcast because all
can listen at once, but only one can transmit. Although we assume in our expressions
that Reduce takes O(p) time for p processors note that Martel [1994] found O(log p)
to be the optimal time for Reduce on an Ethernet.
The Bcast command broadcasts a vector of m elements to all other processors.
Depending on the algorithm used to implement it, it can take different amounts of
time.
Timing for Ethernet with and without using Broadcast
For an Ethernet, Bcast takes s+mg or (p-1)(s+mg) depending on whether the
Ethernet broadcast primitive is being taken advantage of or not. Allreduce takes (p-
1)(s+g) or 2(p-1)(s+g) respectively.
Timing for the BSP model
For the BSP the Allreduce takes L+2(p-1)g. (This analysis assumes that the
Reduce and Bcast that are implemented within the Allreduce are done sequentially
with no overlap. This, in fact, is optimal.) This communication block is called a
“superstep.” We show how to Bcast m items both using one superstep and using two
supersteps. Assuming L is a large cost, we want to keep the supersteps to a minimum,
which is why we won't use a logarithmic tree algorithm such as in the CU model (see
the next paragraph). In the first algorithm the root sends the m items to the p-1
processors. This takes L+(p-1)mg time. In the second algorithm, the root splits the m
items into p parts of size m/p. It then sends a different part to each of the p-1
m
processors in the first superstep. This takes L + ( p − 1) time. In the second
p
superstep, each processor sends its part to the other p-2 processors (excluding itself
33
and the root). The root sends its portion to p-1 processors and receives the same. This
m
takes L + ( p − 1) g time. If you add them together you get approximately 2L+2mg
p
[Goudreau et al, 1999].
Timing for the Commonly Used (CU) model
For the commonly used model (CU), Allreduce takes 2(s+g)(log2 p). This can
be seen as follows. An Allreduce can be implemented as a Reduce and then a Bcast.
A Reduce is the command that gathers information from each processor to one “root”
processor. This is the opposite of a Bcast. To broadcast, a processor sends one item to
two other processors who in turn recursively each send to another two processors.
This is a cost of (s+g)(log2 p). It has been shown that an Allreduce performed using a
Reduce followed by a Bcast is optimal to within a factor of two. We assume this
implementation of Allreduce in our model, which is presented in Tables 5.1 - 5.4.
Optimally, an Allreduce can be done as quickly as a Reduce operation [Karp, 1993].
Bcast of m items depends on the algorithm used. In one algorithm, processor
A sends m items to processor B. Both processors recursively send to two other
processors. This takes (s+mg)(log2 p) time. The second algorithm splits the m items
into p/2 pieces of size 2m/p. A sends piece one to B, B sends it to C and so on. As
soon as C gets that piece (B is ready for more) A sends the next piece to B. The last
piece leaves A after p-1 time and arrives at the last processor after 2p-3 time.
2m 2m
p p
Pipeline : A → B → C → D L This takes
æ m ö m
çç s + 2 g ÷÷(2( p − 3) ) = 2 ps + 4mg − 6s − 12 g . This second algorithm can also be
è p ø p
extended from a simple pipeline to using a log2p tree, similar to the first algorithm.
34
More detailed explanation of the BSP and log2 p models can be found in Goudreau et
al [1999] and Culler et al [1993] respectively.
These MPI primitives are used in the steps described in Section 5.1. Allreduce
is used for step 2 “Column Proposal” and Bcast is used for step 3 “Pivot Column
Broadcast.” In our implementation we replaced the broadcast by an Ethernet multicast
operation, which takes advantage of Ethernet’s multicast/broadcast facility. Multicast,
as opposed to broadcast, which affects all processors on the LAN, only sends the
vector to the processors running the program.
5.3 The Experimental Environment
5.3.1 Description of lab(s)
Parallel experiments were performed on a set of homogeneous Sun
workstations. There are 7 identical Sun Ultra 5 Workstations (270 Megahertz), each
with 128 MB RAM, all running Solaris 5.7. A single 100-megabit per second
Ethernet with a shared file system connects them all.
Other workstations that weren’t identical were not used. During testing it was
important to make sure that there was no outside network traffic.
Experiments for the serial version, retroLP, were also performed on a PC. It is
a Dell 610MT Workstation with 384 MB RAM. It has a Pentium II processor running
at 400MHz with a 16KB L1 instruction cache; 16KB L1 data cache; 512KB
integrated L2 cache. The PC environment was used for the results given in Section
7.3.
5.4 Estimates of computation parameters ucol, urow and upiv
For the computation part, the times for division, multiplication and
comparison of a double were estimated. Measurements were made on a loop of

35
15,000 operations. The running time of this loop was then divided by 15,000. All
optimization was turned off in the compilation. In addition, a large array was
allocated. Each element of this array was operated on. This scheme does not allow the
compiler to optimize. It also matches the way operations are performed in a doubly
subscripted array (the tableau). A problem with using the estimates of division,
multiplication and comparison is that the row choice is not only divisions, and the
column choice is not only comparisons, and the pivot is not only multiplications. To
get a closer estimate, the functions for the three different parts of the simplex method
were called for a matrix of size m=1,000 by n=10,000. The three parts are column
choice, row choice and pivot. Note that in practice, many columns are not looked at in
detail within the column choice. It is only necessary to look at the sign of the cost
coefficient and the bound of the non-basic variable. In one empirical test we found
that only 35% of the columns were eligible candidates for the basis; the other
columns could be immediately eliminated. This is further discussed in Section 7.2.1
in reference to the steepest edge column choice rule where this observation makes a
substantial difference. The resulting times were then divided by m=1,000, n=10,000
and mn=10 million for row choice, column choice and pivot respectively. This gave a
"unit" time for that part of the algorithm. This unit is in a sense the amortized time of
all the multiplication and division operations, as well as any other part of the
calculation. These units are listed in the last three columns of Table 5.1. The running
time for any other size problem can be estimated by multiplying the unit row choice
time by m, the unit column choice time by n and the unit pivot time by mn. Although
36
the timing estimates for division, multiplication and comparison given in Table 5.1
are not used in the calculation, they are provided for reference.
Regression is another way of estimating these coefficients. The first step is to
run the actual program on many different problem sizes using differing numbers of
processors. We then tabulate a list of actual computation times taken by the program
runs. Linear regression is then used to estimate the coefficients of our timing
expressions. This is discussed further in Section 7 where we discuss how well our
expressions predict the actual computation times.
5.5 Estimates of communication parameters s, g and L
The network startup time s (latency) can be estimated by sending a 0-length
packet many times; in our experiments, about 10,000. We sent the packet from A to B
then from B back to A [Dongarra and Donigen, 1996]. This is called Ping-Pong. We
then took the total elapsed time and divided by 20,000 to get the per communication
(send/receive) latency. For g, which measures bandwidth, we sent a double vector of
length 8,000 (64Kbytes). We then took the total elapsed time and divided it by 8,000.
We ignored startup time since it should be small relative to the transmission of a large
packet. One can estimate L by L=2s(log2 p).
We estimated these parameters experimentally for the second network of Sun
workstations described in Section 5.3. s and g were calculated for use in the
communication part of the running time.
As by computation, s and g can be estimated through use of regression. In this
case regression yielded better results than the results yielded via direct
experimentation. This is more clearly explained in Section 7.

37
Explanation of Table 5.1 and Table 5.2
Each row of Table 5.1 corresponds to one estimate. This was estimated a
number of times. The average is in the bottom row. Referring to Table 5.1 at the
bottom: s=2.10*10-3 seconds, g=1.76*10-6 seconds, division=1.32*10-7 seconds,
multiplication=7.87*10-8 seconds and comparison=3.81*10-9 seconds. The unit row
choice is 1.65*10-6 seconds, the unit pivot is 1.24*10-7 seconds and the unit column
choice is 3.73*10-8 seconds. L did not have to be calculated since it is a function of s
and p.
As explained in Section 7, there are only two significant coefficients. One is
upiv, which is the coefficient for the pivot step. The second is ucol_se, which is the
coefficient for the column choice rule when the steepest edge column choice rule is
being used. The final coefficient values upiv and ucol_se, used in the formulas for the
6 different communication models, are listed in Table 5.2. These 6 models are listed
in Table 5.3 and are explained in the next section.

38
s g division multiplication comparison unit row choice(urow) unit pivot(upiv) unit col choice(ucol) unit col choice(ucol_se)
1.76E-03 1.23E-06 1.27E-07 1.09E-07 3.86E-09 1.66E-06 1.24E-07 3.73E-08 3.41E-0
1.74E-03 1.82E-06 1.35E-07 7.41E-08 3.80E-09 1.65E-06 1.25E-07 3.74E-08 3.43E-0
2.46E-03 1.88E-06 1.31E-07 7.38E-08 3.80E-09 1.63E-06 1.26E-07 3.72E-08 3.43E-0
2.35E-03 1.81E-06 1.38E-07 7.39E-08 3.80E-09 1.66E-06 1.24E-07 3.72E-08 3.42E-0
2.05E-03 1.85E-06 1.31E-07 7.77E-08 3.80E-09 1.71E-06 1.24E-07 3.71E-08 3.41E-0
2.64E-03 1.84E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.24E-07 3.77E-08 3.43E-0
1.94E-03 1.82E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.26E-07 3.72E-08 3.43E-0
1.88E-03 1.84E-06 1.33E-07 7.37E-08 3.80E-09 1.63E-06 1.24E-07 3.71E-08 3.42E-0
2.10E-03 1.76E-06 1.32E-07 7.87E-08 3.81E-09 1.65E-06 1.24E-07 3.73E-08 3.42E-0
Table 5.1-Coefficient estimates
upiv 1.24E-07
ucol_se 3.42E-07
Table 5.2 – Coefficient estimates used in models

39
ethernet ethernet St. edge ethernet fully connected fully connected fully connected fully connected
activity broadcast broadcast no broadcast common model common model BSP BSP
Alg 1 Alg 2 Alg1 Alg 2
comp. get local max (n/p)ucol (mn/p)(se_ucol) (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol
comm. Allreduce (P-1)(s+2g) P(s+2g) 2(P-1)(s+g) (s+g)(log P) (s+g)(log P) L+2(P-1)g L+2(P-1)g
comp. winner calcs. Pivot row (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow
comm. Bcast column + int (s+(m+1)g) (s+(m+1)g) (P-1)(s+(m+1)g) (s+mg)(log P) 2PS+4mg L+(P-1)mg ~2L+2mg
comp. do pivot ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv
Table 5.3- Expressions of the six models
n 100 1000 10000 50000 100000

p* time p* time p* time p* time p* time
eth-broad 8.1 0.005832156 25.3 0.012539132 80.1 0.033778564 179.1 0.072178696 253.3 0.1009531
eth-broad St. edge 15.5 0.008733838 49.1 0.021736331 155.1 0.062869419 346.9 0.137229171 490.6 0.192948601
eth-nobroad 2.5 0.009358383 7.7 0.031334104 24.5 0.100926097 54.7 0.226745932 77.3 0.321026613
comm. Mod alg1 4.6 0.005184726 45.8 0.007063647 457.3 0.008949164 2286.5 0.010267533 4572.9 0.010835344
comm. Mod alg2 26.4 0.022746269 31.3 0.029126954 62.2 0.063022539 129.4 0.129994516 181.4 0.180583468
BSP alg1 13.3 0.023429024 15.8 0.034337139 31.3 0.090299238 65.3 0.19768193 91.6 0.278111702
BSP alg2 64.7 0.011540454 203.8 0.014284651 644.1 0.018843167 1440.2 0.024959694 2036.7 0.029115974
Table 5.4 – p* and optimal time per iteration

40
5.6 The performance model
The program as explained in Section 3 is an iterative method. It therefore
depends on the number of iterations. Our method parallelizes each iteration.
Assuming we use the same column choice rule whether we use the full tableau
method or the revised method, the number of iterations should be the same. Our
analysis therefore focuses on the timing within an iteration. To get the total running
time the number of iterations can then be multiplied by the value calculated. We are
careful to only include the time spent solving the problem; e.g., the time taken to read
input or report output is not included.
Explanation of Table 5.3
Section 5.2 gave a short sketch of one distributed pivot step. Step 1 in Section
5.2 consists of computation within each processor of its local maximum followed by
communication of the maximums to the coordinating processor. The timing for these
actions is given in the first two rows of Table 5.3. Step 2 in Section 5.2 consists of
computation within the “winning” processor of the leaving basic variable (row
choice) and the communication via broadcast of both its pivot column together with
its row choice. The timing for these actions is given in rows 3 and 4 of Table 5.3.
Finally the timing of the pivot within Step 3 of Section 5.2 is given in the last row of
Table 5.3. Note that this analysis assumes that pivot steps will be performed every
iteration.
urow, ucol, upiv, s and g are constants as in the previous section. se_ucol is
the constant for the steepest edge column choice rule. This constant is close to upiv in
value. In rows 1, 3 and 5 each processor goes through its n/p columns, m rows and
41
n +1
(m + 1) matrix elements respectively. Computation calculations similar to those
p
in rows 1, 3 and 5 in Table 5.3 can be found in Nash and Sofer [1996, pgs. 114-116].
The communication rows (2 and 4) of Table 5.3 were explained in Section 5.2.2.
Each column of the table corresponds to a model. To get the complete running
time of the program for a given model, simply sum that column and multiply by the
number of iterations. This assumes the classical column choice rule was used with a
pivot in each iteration within Step 3 of the steps listed in Section 5.2. The second
column is the only column that assumes the steepest edge rule. The only change from
the first column is in the first row.
5.7 The optimum number of processors
For our distributed algorithm, there is a tradeoff between communication time
and computation time. As the number of processors increases, the computation time
decreases, but the communication time increases. For each communication model, we
can estimate the optimal number of processors to use by taking the derivative of the
timing function of the algorithm with respect to p. This is then set to 0 and solved for
p. The timing functions correspond to the columns in Table 5.3.
As an example consider the Ethernet model without broadcast. Using the
given formulas with our estimated values of s, g, ucol, urow, upiv with m = 100 and n
= 50,000 results in an optimal number of processors p = 16.3. In practice this number
would have to be rounded.

42
n æ (n + 1)(m + 1) ö
f ( p) = ucol + 2(p - 1)(s + g ) + m * urow + ( p − 1)(s + (m + 1)g ) + çç ÷÷ upiv
p è p ø
df -n * ucol + (n + 1)(m + 1)urow
= + 3s + 3 g + mg = 0
dp p2
Table 5.4 gives the predicted optimal number of processors for m = 1,000
where n varies from 100 to 100,000.
The following expressions generalize this example. The expressions drop the
units (m instead of m+1). γ, ρ, π and Γ are ucol, urow, upivot and se_ucol
respectively. p* is the optimum p. Three expressions are given. The first corresponds
to the Dantzig rule assuming Ethernet with broadcast as in our example above. The
second corresponds to the Steepest Edge rule again assuming Ethernet with broadcast.
The third expression corresponds to the Dantzig rule but the communication model is
Ethernet without the broadcast facility.
γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
Dantzig Rule:
γ n + π mn
p* =
s+g
Γm n π mn
T= + ρm+ + s + gm + ( s + g ) p
p p
Steepest Edge:
m n (Γ + π )
p* =
s+g
γn π mn
T= + ρm+ + [ s + gm] p + ( s + g ) p
p p
Linear Broadcast:
γ n + π mn
p* =
2s + g (m + 1)
43
5.8 The optimum number of processors with a new column division scheme
Classic rule
The optimum p* derived in the previous section assumed that columns were
equally divided among the processors. After the processors each calculated their best
choice of entering column the processors did a communication step. This
communication took a total of ps time; one proposal for each processor. If we can
give an unequal number of columns to each processor, some of the communication
would be simultaneous with computation.
We would like to divide the n columns in the following way:
n = n0, n 0 + k, n 0 + 2k, L , n 0 + (p - 1)k Þ

p ( p − 1)
n = n0 p + k Þ
2
n p −1
n0 = − k
p 2
Where n0 is the number of columns every processor has as a base and k is the
additional columns a processor gets over the previous processor. k should be the
number of columns whose computation takes the time of one send.
s + g = (πm + γ )k Þ
s+g
k=
πm + γ
Notice that now instead of a cost of ps there is only the cost of the s of the last
processor. The new timing function and p* is
T = γ (n0 + ( p − 1)k ) + ρm + πm(n0 + ( p − 1)k ) + s + gm + s + g

2n γ n + π mn
p* = = 2
k s+g
44
This scheme’s p* is a factor of the square root of 2 of the old scheme’s p*.
As an example take a problem where m=1,000 and n=5,000. The first scheme
calculates
p* = 53.53
n
n0 = = 93.40
p
k =0
T = .02347 seconds per iteration
The new scheme calculates
p* = 75.95
n0 = .866
k = 1.73
T = .01761 seconds per iteration
This is a significant savings in time.
All of the calculations and experiments in this paper used the first scheme.
The last scheme is mentioned in the future work section (Section 8).
Steepest edge rule
A similar analysis is provided for the steepest edge column choice rule.
s + g = (πm + Γm)k Þ
s+g
k=
πm + Γ m
T = Γm(n0 + ( p − 1)k ) + ρm + πm(n0 + ( p − 1)k ) + s + gm + s + g

2n Γm n + π mn
p* = = 2
k s+g
This scheme’s p* is also a factor of the square root of 2 larger than the old
scheme’s p*.
45
Figure 5.1 shows the time per iteration as n increases. Both schemes are
compared using both column choice rules.
0.5000
0.4500
0.4000
time per iteration
0.3500
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
-
0
00
00
0
00
00
00
00
00
5
0,
4,
5,
10
25
50
75
10
n
Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2
Figure 5.1- time per iteration as n increases
5.9 Estimated parallel running time for each communication model
Table 5.5 consists of the estimated timing for all the communication methods.
We use the results of Table 5.4 as the number of processors (p) to plug into the
expressions in Table 5.3. In Table 5.5, m is set at 1,000. It is now easy to estimate the
ratio of computation to communication. For example in the Ethernet-broadcast model
with n=50,000, computation time is a total of .0359 seconds per iteration.
Communication takes .0362 seconds per iteration.

46
n 100 1000 10,000 50,000 100,000

eth-broad col choice comp. 3.11E-10 9.86E-10 3.12E-09 6.98E-09 9.87E-09
comm. 1.57E-03 4.72E-03 1.53E-02 3.45E-02 4.89E-02
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 1.69E-03 1.69E-03 1.69E-03 1.69E-03 1.69E-03
pivot comp. 1.56E-03 4.91E-03 1.55E-02 3.47E-02 4.91E-02
comp: 2.77E-03 6.12E-03 1.67E-02 3.59E-02 5.03E-02
comm: 3.27E-03 6.41E-03 1.70E-02 3.62E-02 5.06E-02
total: 6.04E-03 1.25E-02 3.38E-02 7.22E-02 1.01E-01
eth-broad col choice comp. 2.20E-03 6.97E-03 2.21E-02 4.93E-02 6.98E-02
St. Edge comm. 3.04E-03 9.59E-03 3.03E-02 6.78E-02 9.59E-02
comm. 1.69E-03 1.69E-03 1.69E-03 1.69E-03 1.69E-03
pivot comp. 8.09E-04 2.54E-03 8.02E-03 1.79E-02 2.54E-02
comp: 4.22E-03 1.07E-02 3.13E-02 6.85E-02 9.63E-02
comm: 4.73E-03 1.13E-02 3.20E-02 6.95E-02 9.75E-02
total: 8.95E-03 2.20E-02 6.33E-02 1.38E-01 1.94E-01
eth-nobroad col choice comp. 1.02E-09 3.23E-09 1.02E-08 2.29E-08 3.23E-08
comm. 5.65E-04 2.61E-03 9.09E-03 2.08E-02 2.96E-02
comm. 2.47E-03 1.14E-02 3.97E-02 9.09E-02 1.29E-01
pivot comp. 5.12E-03 1.61E-02 5.09E-02 1.14E-01 1.61E-01
comp: 6.33E-03 1.73E-02 5.21E-02 1.15E-01 1.62E-01
comm: 3.03E-03 1.40E-02 4.88E-02 1.12E-01 1.59E-01
total: 9.36E-03 3.13E-02 1.01E-01 2.27E-01 3.21E-01
CU (logP) col choice comp. 5.41E-10 5.46E-10 5.47E-10 5.47E-10 5.47E-10
alg 1 comm. 4.28E-04 1.07E-03 1.71E-03 2.16E-03 2.36E-03
comm. 3.74E-03 9.34E-03 1.50E-02 1.89E-02 2.06E-02
pivot comp. 2.72E-03 2.72E-03 2.72E-03 2.72E-03 2.72E-03
comp: 3.93E-03 3.93E-03 3.93E-03 3.93E-03 3.93E-03
comm: 4.16E-03 1.04E-02 1.67E-02 2.10E-02 2.29E-02
total: 8.10E-03 1.43E-02 2.06E-02 2.50E-02 2.69E-02
CU (logP) col choice comp. 9.48E-11 7.98E-10 4.02E-09 9.66E-09 1.38E-08
alg 2 comm. 9.15E-04 9.64E-04 1.16E-03 1.36E-03 1.45E-03
comm. 1.61E-02 1.81E-02 2.99E-02 5.58E-02 7.58E-02
pivot comp. 4.77E-04 3.97E-03 2.00E-02 4.81E-02 6.86E-02
comp: 1.69E-03 5.18E-03 2.12E-02 4.93E-02 6.98E-02
comm: 1.71E-02 1.90E-02 3.11E-02 5.72E-02 7.72E-02
total: 1.87E-02 2.42E-02 5.23E-02 1.06E-01 1.47E-01
BSP alg1 col choice comp. 1.89E-10 1.59E-09 7.98E-09 1.91E-08 2.73E-08
comm. 1.47E-03 1.58E-03 2.00E-03 2.51E-03 2.78E-03
comm. 1.98E-02 2.37E-02 4.74E-02 9.88E-02 1.38E-01
pivot comp. 9.48E-04 7.90E-03 3.97E-02 9.52E-02 1.36E-01
comp: 2.16E-03 9.11E-03 4.09E-02 9.64E-02 1.37E-01
comm: 2.13E-02 2.52E-02 4.94E-02 1.01E-01 1.41E-01
total: 2.34E-02 3.44E-02 9.03E-02 1.98E-01 2.78E-01
BSP alg2 col choice comp. 3.86E-11 1.23E-10 3.88E-10 8.68E-10 1.23E-09
comm. 2.51E-03 3.56E-03 5.52E-03 8.35E-03 1.03E-02
comm. 7.63E-03 8.90E-03 1.02E-02 1.11E-02 1.15E-02
pivot comp. 1.94E-04 6.11E-04 1.93E-03 4.32E-03 6.11E-03
comp: 1.40E-03 1.82E-03 3.14E-03 5.53E-03 7.32E-03
comm: 1.01E-02 1.25E-02 1.57E-02 1.94E-02 2.18E-02
total: 1.15E-02 1.43E-02 1.88E-02 2.50E-02 2.91E-02
Table 5.5 – Time per iteration for m=1,000
5.10 Running time estimates of the revised (MINOS), serial (retroLP) and
parallel Full-tableau algorithms (dpLP)
Table 5.6 is based on the models of Sections 3.4, 5.5-5-8 and uses the
coefficients given in Table 5.2. Table 5.6 compares the estimated running time per
iteration of three algorithms for problems of varying size. The three algorithms we
compared are our serial full-tableau simplex method, the revised method and our
47
parallel full-tableau simplex method. The parallel simplex in the table assumes an
Ethernet with broadcast and the optimum number of processors. Two sets of optimal
number of processors are shown corresponding to the simple scheme of equally
dividing up the columns amongst the processors and the scheme proposed in Section
5.8. Both the serial time and the parallel time are also shown when the steepest edge
column choice rule is used. Times per iteration for the revised method were
estimated, assuming both a dense (100%) tableau and a sparse (5%) tableau. Both
densities were not shown for the full tableau method because density has very little
effect on the running time of the full-tableau algorithms whereas the revised method
runs more quickly on sparse problems. The last two columns of Table 5.6 show the
optimum number of processors to use when employing the standard and steepest edge
column choice rules respectively.
The optimum number of processors p would in practice be an integer even
though mathematically it can be a fraction. The timing of the algorithm is not
sensitive to this approximation (see Section 5.14). Note that the revised method, for
completely dense problems, is slower than the full tableau for all n in the table (aside
from the last line). As n rises, the revised method catches up at a very slow rate. The
revised method catches up when n is equal 2m2+m as calculated in Section 3.3. This
is shown on the last line.

48
m n serial serial S.E. Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2 revised dense revised sparse p*-standard p*-St.Edge
5,000 4,500 2.80 10.50 0.0601 0.0468 0.1038 0.0776 12.13 3.56 120.13 232.68
5,000 5,000 3.12 11.67 0.0627 0.0486 0.1087 0.0811 12.44 3.58 126.63 245.26
5,000 10,000 6.22 23.33 0.0830 0.0629 0.1481 0.1089 15.55 3.73 179.07 346.85
5,000 25,000 15.55 58.32 0.1233 0.0915 0.2262 0.1642 24.87 4.20 283.13 548.41
5,000 50,000 31.09 116.64 0.1688 0.1236 0.3143 0.2265 40.41 4.97 400.40 775.56
5,000 75,000 46.63 174.95 0.2037 0.1483 0.3819 0.2743 55.95 5.75 490.39 949.87
5,000 100,000 62.18 233.27 0.2331 0.1691 0.4389 0.3146 71.49 6.53 566.25 1,096.81
10,000 9,000 11.20 42.00 0.1203 0.0933 0.2076 0.1550 48.50 14.24 240.24 465.34
10,000 10,000 12.45 46.66 0.1253 0.0968 0.2173 0.1619 49.74 14.30 253.23 490.51
10,000 20,000 24.88 93.31 0.1660 0.1256 0.2961 0.2176 62.17 14.92 358.12 693.68
10,000 50,000 62.18 233.27 0.2467 0.1826 0.4524 0.3281 99.47 16.79 566.22 1,096.80
10,000 100,000 124.34 466.52 0.3376 0.2470 0.6286 0.4527 161.63 19.89 800.76 1,551.10
10,000 150,000 186.51 699.77 0.4074 0.2963 0.7638 0.5483 223.79 23.00 980.72 1,899.71
10,000 200,000 248.67 933.02 0.4663 0.3379 0.8778 0.6289 285.94 26.11 1,132.44 2,193.59
10,000 500,000 621.66 2,332.54 0.7215 0.5184 1.3721 0.9785 658.89 44.76 1,790.54 3,468.37
10,000 1,000,000 1,243.31 4,665.07 1.0091 0.7217 1.9293 1.3724 1,280.48 75.84 2,532.21 4,905.02
10,000 300,010,000 373,000.75 1,399,562.96 17.0359 12.0538 32.9740 23.3240 373,000.75 18,661.86 43,859.85 84,958.82
Table 5.6- estimated running time per iteration
49
80.0
70.0 serial
60.0
50.0 Par. Bcast
40.0
30.0
revised
dense
time per iteration

20.0
revised
10.0 sparse
-
4,500 5,000 10,000 25,000 50,000 75,000 100,000
n
Figure 5.2 – Time per iteration (m=1,000)

50
In Figure 5.2, the 3 algorithms are compared. The figure corresponds to the
top part of Table 5.6 where m is 5,000. The x-axis is n and the y-axis is the estimated
time per iteration in seconds. This comparison is for dense problems. The revised is
even slower than the standard simplex. This is due to the extra computation needed to
calculate the objective row and the pivot column. As just mentioned, the figure shows
the revised method slowly catching up to the full tableau method as n increases. It
takes such a large column size to catch up that it seems reasonable to say that for all
practical dense problems the revised method is slower than the standard method.
Moving over to the Ethernet based parallel algorithm one can see that the added time
for a higher n is minimal. It is only the cost of sending a larger vector over the
Ethernet, which is a very small cost for an Ethernet with a broadcast facility.
There are four variables to deal with when comparing the different methods.
They are Aspect Ratio (AR=n/m), size (m), density (d), and number of processors (p).
Figures 5.3, 5.4 and 5.5, compare the serial full tableau method using the
standard column choice rule, the serial full tableau method using the steepest edge
column choice rule and the revised method. They compare them as density, aspect
ratio and m are varied, respectively. The base problem is m=100, p=1 (serial method),
density=5% and AR=10 (which implies that n=1,000).
Figure 5.6 is the fourth graph of the group. It shows a comparison between the
parallel full tableau method using the standard column choice rule, the parallel full
tableau method using the steepest edge column choice rule and the revised method.
The parallel method uses an Ethernet with broadcast. We also include the Ethernet
without use of broadcast.

51
Figure 5.7 is the same as Figure 5.6 except that it is for a larger problem
where m=10,000 and n=100,000. We graphed this in order to show a problem for
which the parallel method using a small number of processors would actually
overtake the revised method.
In both Figure 5.6 and Figure 5.7 the point on each curve that is the lowest
time per iteration is where the optimum p (p*) is being used. The p* values for the
parallel methods graphed in Figure 5.7 are:
eth-broad 75.7
eth-broad St. edge 143.5
eth-nobroad 20.1
Notice that these numbers can be rounded to whole integers. From the graphs
one can see that the time per iteration for Ethernet using broadcast is not very
sensitive to p in the range of p*/2 to p* (see Section 5.7).
0.05
0.05
0.04
0.04
time per iteration
0.03
0.03
0.02
0.02
0.01
0.01
0.00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
density (non-zero coefficients)
revised serial Serial S.E.
Figure 5.3-Time per iteration as a function of density

52
0.20
time per iteration

0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0.5 0.8 1 5 10 15 20 25 30 35 40
Aspect Ratio
Figure 5.4-Time per iteration as a function of aspect ratio
1.80
1.60
1.40
time per iteration
1.20
1.00
0.80
0.60
0.40
0.20
0.00
50 100 200 300 400 500 600 700 800 900 1000
number of rows, m
Figure 5.5- Time per iteration as a function of m

53
5.0E-02
4.5E-02
4.0E-02
3.5E-02
3.0E-02
2.5E-02
2.0E-02
1.5E-02
time per iteration

1.0E-02
5.0E-03
0.0E+00
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
number of processors, p
revised Eth-Bcast Eth-NoBcast Eth St.edge

Figure 5.6 – Time per iteration as a function of p
54
5.00E-01
4.50E-01
4.00E-01
3.50E-01
3.00E-01
2.50E-01
2.00E-01
time per iteration

1.50E-01
1.00E-01
5.00E-02
0.00E+00
1
8
6
3
0
7
4
1
8
5
15
22
29
36
43
50
57
64
71
78
85
92
99
10
11
12
12
13
14
14
15
number of processors, p
revised Eth-Bcast Eth-NoBcast Eth St.edge

Figure 5.7 – Time per iteration as a function of p
55
Analysis of figures 5.3 – 5.7
For sparse problems the revised is much quicker than the serial full tableau
method. Our distributed method when used with the optimal number of processors on
an Ethernet with broadcast is even faster than the revised method. We thus have two
important parameters that determine whether the revised method or our full tableau
(retroLP and dpLP) algorithms are faster. One parameter is density and the other is
the number of processors that we have. If we have a completely dense problem our
method is faster. If we have the optimum number of processors our method is often
faster even on sparse problems. This can be seen in Figures 5.6 and 5.7. The figures
refer to problems with only 5% density. The parallel algorithm on an Ethernet with
broadcast comes extremely close to the time of the revised, for a problem size of 100
x 1,000 (figure 5.6) for its optimum 7.6 processors. For a problem size of 1,000 X
10,000 the parallel method would overtake the revised if it had 7 processors, much
lower than its optimum. This is shown in Figure 5.7. The optimum p for that example
is about 76 and 144 processors for the standard and steepest edge rules respectively.
Even in our first example a slightly higher density would already cause the revised to
be slower. In practice our method can therefore be used in two cases. One case:
problems that have a significant number of nonzero coefficients. Examples of such
applications are given in Section 7. The second case occurs whenever we have access
to many processors – even for problems that are not dense.
5.11 Advantage of the Steepest Edge column choice rule
These comparisons assume all methods are using the classical Dantzig column
choice rule. The tableau method and especially the distributed method can make very
effective use of alternative column choice rules to further improve the relative
56
performance. One of these rules is the steepest edge rule. As mentioned in Section
4.1.2 it is not necessary to re-compute any columns in order to perform this rule. It
simply requires, at most, an extra mn multiplications within the column choice step.
The steepest edge rule is included in these tables and in Figures 5.6 and 5.7. Notice
how expensive it is to use with only one processor and how quickly it speeds up as
the processor number increases. Here we only compare time per iteration. The gain of
the alternate column choice rules is in the fewer number of iterations that are
necessary. This extra computation is more than offset by the reduction in the number
of iterations [Harris 1973; Goldfarb and Reid 1977; Forrest and Goldfarb, 1992]. See
Section 7.4 for experimental results supporting this view. This extra computation is
shared amongst the processors just like the rest of the computation. These rules can
be faster than the classical rule even without the use of multiple processors. This
effect is compounded when using multiple processors.
Note that the expressions and all the graphs assume that every column is
looked at using the steepest edge column choice rule. This in fact is not true. In one
empirical test we found, on average, that only 35% of the columns were eligible and
therefore looked at. All the charts and graphs assume all columns were eligible. This
is an upper bound. The steepest edge does even better in practice than that which is
depicted in the graphs. This is also discussed in Section 7.
An expression for time per iteration using the steepest edge rule was provided
in Section 5.8. As the number of processors increases, the iteration time of the
steepest edge rule becomes more competitive with the iteration time of the standard
rule. It cannot actually catch up to the speed of the standard rule per iteration. We
57
would like to calculate the percentage fewer iterations needed to make the steepest
edge overtake the standard column choice rule. The expression for this is simply time
per iteration of the standard rule divided by time per iteration of the steepest edge.
For a problem of size 1,000 by 10,000 (Figure 5.7) the percentage when p=1 is
27.89%. This means that the steepest edge must iterate 27.89% of the iterations
performed by the standard method in order for it to catch up in speed. The higher the
percentage, the better it is for the steepest edge method. For 76 processors (the
optimum for the standard method) the percentage is 46.04%. For 144 processors (the
optimum for the steepest edge) the percentage is 65.92%. Section 7.3 shows how
many iterations were actually performed by the steepest edge method in practice.
5.12 Memory requirements of revised and full tableau method
The full tableau holds a data matrix of m+1 by n+1 double precision floating-
point numbers. It also has a few auxiliary vectors, which are not included in this
calculation. (6 vectors of size m and 6 vectors of size n.) A double precision element
is 8 bytes. That amounts to 8(m+1)(n+1) bytes. A problem with table size 100x100
(m=100, n=100) takes 81,608 bytes and a problem size of 1,000 X 10,000 takes
80,088,008 bytes. Our parallel implementation would cut memory requirements on
each processor to approximately memory/p. where p is the number of processors and
memory is the memory required when using only one processor. The revised method
requires (m+1)(n+1) elements for the original data. In addition it requires a (m+1) by
(m+1) matrix for the inverse of the basis assuming an explicit basis inverse is
maintained. It also has an extra vector of size m+1 for the pivot column, which is not
included in the calculation. This equals (m+1)(n+1)+(m+1)(m+1). A double precision

58
element takes up 8 eight bytes. That amounts to
8[(m + 1)(n + 1) + (m + 1)(m + 1)] bytes. A problem with table size 100x100 takes
163,216 bytes and a problem size of 1,000 X 10,000 takes 88,104,116 bytes. Figure
5.8 is a graph of memory requirements in megabytes for both the tableau method and
the revised method as m gets larger and assuming n=10m. Notice that for dense
problems with this aspect ratio, the revised method takes approximately 10% more
space than the full tableau method.
1 1
1+ 1+
(m + 1)(n + 1) + (m + 1)(m + 1) m +1 m = 1+ m
= 1+ = 1+
(m + 1)(n + 1) n +1 n 1 1
+ 10 +
m m m
This assumed a direct inverse representation of the basis for the revised
simplex method. Functional equivalents would take a similar amount of storage in the
dense case. For sparse problems the revised method saves a lot of space by using
sparse matrix techniques.

59
10,000.00
9,000.00
8,000.00
7,000.00
6,000.00
memory full tableau
5,000.00
revised
4,000.00
3,000.00
2,000.00
1,000.00
-
0
00
00
00
00
00
10
10
30
50
70
90
m
Figure 5.8 – Memory (in Megabytes)
5.13 Asymptotic (computation/communication ratio change) analysis
The coefficients used here were based on the network described in Section
5.3. It is interesting to see how it would fare on more current networks and on
networks in the future where the computation to communication time ratios may
change.
We will assume that both s and g communication parameters would change
together. On our network s=3*10-4 and g=1.7*10-6 where g corresponds to sending
one double (for a byte it would be .425*10-6). The computation operations for a
double floating point number are of order 10-7. These parameters include unit column
choice for steepest edge, unit pivot, multiplication and division. For simplicity all
computational operations in this analysis will be assumed to be 10-7 even though there
are slight variations in practice.

60
Figure 5.9 and Figure 5.10 show the asymptotic change in the speedup of our
parallel program, when exactly 7 processors are used, as the ratio of the computation
time to communication time changes. The leftmost points in both figures show the
current speedup assuming the current speeds of s, g and computation. Each point
along the horizontal axis assumes that the communication or computation speed, for
Figures 5.9 and 5.10 respectively, doubles from the speed of the point to its left.
Figure 5.9 shows what happens when both s and g get faster and the computation
speed is held constant. On each subsequent point the speed of both s and g are
doubled. Figure 5.10, on the other hand, shows what happens when the computation
speed gets faster and both s and g are held constant. On each subsequent point the
computation speed is doubled. We assume a problem size of 1,000 by 10,000 running
on an Ethernet with broadcast and where the optimum p, p*, is used in dpLP.
From the figures we can see that the speedup is affected very much by
changes in computation and communication parameters. As s and g get faster we can
take advantage of more processors since the communication costs are lower. On the
other hand, as computation speeds increase, the relative communication cost
increases. This causes a decrease in speedup since we can use fewer processors.
61
350.0
300.0
250.0
speedup
200.0
150.0
100.0
50.0
0.0
1 2 4 8 16 32 64 128
computation time/communication time ratio scaled
speedup
Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move together
62
30.0
25.0
20.0
speedup
15.0
10.0
5.0
0.0
1.000 0.500 0.250 0.125 0.063 0.031 0.016 0.008
computation time/communication time ratio scaled
speedup
Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 . Computation is
changing
5.14 Sensitivity to s, π and p
In estimating parameters it is important to know how sensitive the timing is to
small changes in the parameters. This is important because, before the program is run,
our timing expressions tell us how many, p*, processors to use. We then know the
maximum number of processors that we need for the problem.
p*, if only changed slightly, does not significantly affect the overall running
time. This suggests that we can round p* to the nearest whole number without
significantly affecting the predicted running time. More importantly, it might be
difficult to use the optimum p* when it is a large number. We would like to know
63
how much the timing would be affected if we use a few less processors than the
optimum.
Sensitivity of timing to changes in p at p*:
Assume that we round the optimal number of processors, p*, to its nearest
integer pint*, and we then run the problem with pint* processors. We should have used
p* (not pint*). The amount of time the program runs with the wrong pint* is Tint. We
should have used p* processors which would have given a running time of T.
T = T ( p *)
Tint = T ( pint *)
γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
γ n + π mn
p* = Þ γ n + π mn = p * 2 ( s + g ) (1.1)
s+g
64
γ n + π mn
T ( p*) = + ( ρ + g )m + ( s + g ) p * + s
p*
γ n + π mn γ n + π mn
= s + g + ( ρ + g )m + ( s + g ) +s
γ n + π mn s+g
= γ n + π mn s + g + ( ρ + g )m + s + g γ n + π mn + s
1 1
= 2(γ n + π mn) ( s + g ) + ( ρ + g )m + s
2 2
1 1
= 2 p * ( s + g ) ( s + g ) + ( ρ + g )m + s
2 2
from (1.1)
T ( p*) = 2 p * ( s + g ) + ( ρ + g )m + s
≥ 2 p * (s + g )
γ n + π mn
Tint = + ( ρ + g )m + ( s + g ) p int + s
p int
γ n + π mn
Tint − T ( p*) = + ( s + g ) p int − 2 p * ( s + g )
p int
p *2 (s + g )
= + ( s + g ) p int − 2 p * ( s + g ) from (1.1)
p int
æ p *2 ö
= ( s + g )çç + p int − 2 p *÷÷
è p int ø
æ p * + p int − 2 p * p int
2 2
ö
= ( s + g )ç ÷
ç p int ÷
è ø
æ ( p * − p int ) 2 ö
= ( s + g )çç ÷
÷ perfect square
è p int ø
æ (.5) 2 ö
< ( s + g )çç ÷÷ round p * to the nearest integer
è p int ø
1æs+ gö
< ç ÷
4 çè p int ÷ø
1æs+ gö
ç ÷
Tint − T ( p*) 4 çè p int ÷ø 1
< = bound
T ( p*) 2 p * ( s + g ) 8 p int p *
A change in p* does not substantially affect T.
65
This can also be seen approximately by the second derivative in the Taylor series:
dT − γ n − π mn
= +s+g
dp p2
γ n + π mn
T"=
p3
2( s + g )
T '' = at p *
p
1
Tint − T ( p*) = T ' ( p*)( pint − p*) + T " ( pφ ( pint − p*) 2
2
1 1
| Tint − T ( p*) |≤ T " ( pφ ) since p int − p * < .5
2 4
1 γ n + π mn 1 p * 2 ( s + g ) æ 10 −4 ö
≤ 3
≤ 3
≡ Οçç ÷÷
8 pφ 4 pφ è p * ø
The second derivative is extremely small, on the order of 10-5 whereas an
iteration on a 1000 by 5000 size problem, with 10 processors, takes about .1 seconds.
p* is at least 1 and is usually over 10 for even relatively small problems. Rounding p*
to the nearest unit does not affect T in any significant way.
Sensitivity of timing to changes in s:
Assume we think that startup time is serr, we then calculate perr* based on serr
and run the problem with perr* processors. In fact the startup time is s (not serr). The
amount of time the program runs with the wrong perr* is Terr. We should have used p*
processors which would have given a running time of T.

66
T = T ( p * ( s ), s )
Terr = T ( p err * ( s err ), s )
T = 2 p * ( s + g ) + ( ρ + g )m + s from (1.1)
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
p err
γ n + π mn
Terr − T = + ( s + g ) p err − 2 p * ( s + g )
p err
γ n + π mn
p err = Þ
s err + g
γ n + π mn γ n + π mn γ n + π mn
Terr − T = + (s + g ) −2 (s + g )
γ n + π mn s err + g s+g
s err + g
æ æ 1 2 ö÷ ö÷
= γ n + π mn ç s err + g + (s + g )ç −
ç ç s +g s + g ÷ø ÷ø
è è err
æ s+g ö
= γ n + π mn ç s err + g + −2 s+ g÷
ç s err + g ÷
è ø
æ ( s err + g ) + ( s + g ) − 2 s + g s err + g ö
= γ n + π mn ç ÷
ç s err + g ÷
è ø
=
γ n + π mn
s err + g
( s + g − s err + g )
2
Terr − T = p err * ( s + g − s err + g )

2
Sensitivity of timing to changes in π:
Assume we think that a double floating-point multiplication time is πerr, we

then calculate perr* based on πerr and run the problem with perr* processors. In fact
floating-point multiplication time is π (not πerr). The amount of time the program runs
with the wrong perr* is Terr. We should have used p* processors which would give a
running time of T.
67
T = T ( p * (π ), π )
Terr = T ( p err * (π err ), π )
γ n + π mn
p* = Þ γ n + π mn = p *2 ( s + g ) (1.1)
s+g
γ n + π err mn
perr * = Þ γ n + π err mn = p *2 ( s + g ) (1.2)
s+g
γ n + π mn − γ n + π err mn
p-perr = Þ
s+g
p 2 ( s + g ) − perr 2( s + g ) = π mn − π err mn Þ
2( s + g )( p − perr )( p + p err ) = 2mn(π − π err ) Þ
2mn(π − π err )
2( s + g )( p − p err ) = (1.3)
( p + p err )
γ n + π mn
T= + ( ρ + g )m + ( s + g ) p + s
p
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
perr
æ1 1 ö π mn π err mn
T − Terr = γ nçç − ÷÷ + − + ( s + g )( p − p err )
è p p err ø p p err
= ( s + g )( p − p err + p − perr )
= 2( p − p err )( s + g )
2mn(π − π err )
= from (1.3)
( p + p err )
68
Taylor series for π :

1
−
∂p 1 æ γ n + π mn ö 2 æ mn ö
= ç ÷ çç ÷÷
∂π 2 çè s + g ÷ø ès+gø
1 æ mn ö
= ç ÷ (1.6)
2 p çè s + g ÷ø
∂p
pmn − (γ n + π φ mn)
∂T ∂π + ( s + g ) ∂p
=
∂π p2 ∂π
1
mn
pmn − p 2 ( s + g )
2 p(s + g ) ( s + g )mn
= + from (1.6)
p 2
2 p(s + g )
pmn
pmn −
= 2 + mn
2
p 2p
mn
= (1.4)
p
∂p
− mn
∂ 2T ∂π = − mn mn
= from (1.6)
∂π 2 p 2
p 2 p( s + g )
2
− (mn) 2
=
2 p 2 (s + g )
− (mn) 2
= from (1.1 ) and (1.5)
2(γ n + π mn)
∂T 1 ∂ 2T
T − Terr = (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2
where π φ is some point in between π andπ err

mn 1 (mn) 2
= (π err )(π − π err ) + (π φ )(π − π err ) 2 from (1.4) and (1.5)
perr 4 2(γ n + π φ mn)
69
Taylor series for p when π changes :

∂2 p 1 æ 1 ö (mn) 2
= − ç ÷ <0 (1.7)
∂π 2 4 çè p 2 ÷ø ( s + g ) 2
∂p 1 ∂2 p
p − p err = (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2
1 æ mn ö
= ç ÷(π − π err ) from (1.6) and (1.7)
2 p err çè s + g ÷ø
∂p 1 ∂2 p
p − p err < (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2
Graphs showing the sensitivity of both p* and Time per iteration to both changes in s
and changes in π.
Table 5.7 shows what happens as the error in startup time (s) increases. The
correct s value is in the middle of the table in italics. It has a 0% error. Both p* and
time per iteration are shown for each error in the last two columns. Figure 5.11 and
Figure 5.12 graph p* and time per iteration, respectively, for the percentage errors in
s. The correct s value is in the center of the horizontal axis at 0% error. As you move
to the right the error assigns s too high a speed. As you move to the left the error
assigns s too low a speed. Note that the time per iteration increases in whichever
direction we err and irrespective of whether p* increases or decreases. This is because
we are no longer using the optimum p*.
Table 5.8 shows what happens as the error in pivot time (π) increases Figures
5.13 and 5.14 correspond to Table 5.8. The analysis of the last paragraph for s
correspondingly applies to errors in π.
A problem of 1,000 by 10,000 was used for these tables.
Note that a 30% change in π gives a 10% error in time per iteration and a 40%
change in s gives a 10% error in time per iteration.

70
s % error p* T
2.694E-04 -40% 67.75 0.0343 0.0360
2.501E-04 -30% 70.29 0.0341 0.0355

2.309E-04 -20% 73.14 0.0340
2.116E-04 -10% 76.37 0.0339 0.0350
1.924E-04 0% 80.07 0.0338 0.0345

1.732E-04 10% 84.37 0.0339 T
1.539E-04 20% 89.44 0.0340 0.0340
1.347E-04 30% 95.55 0.0343
time per iteration

0.0335
1.154E-04 40% 103.11 0.0348
9.620E-05 50% 112.80 0.0356 0.0330
Table 5.7-p* and T as a function of relative error in s 0.0325
%
%
%
%
%
%
%
%
%
0
0
0
0
0%
10
20
30
40
50
-4
-3
-2
-1
% error in s
Figure 5.11 – Time per iteration as a function of relative error in s
120.00
100.00
80.00
60.00 p*
p*
40.00
20.00
-
%
%
%
%
%
0%
0%
0%
0%
0%
10
20
30
40
50
-4
-3
-2
-1
% erro r in s
Figure 5.12 - p* as a function of relative error in s

71
0 .0 7 8 0
π % error p* T
0 .0 7 7 0
1.740E-07 -40% 189.52 0.0737
1.616E-07 -30% 182.62 0.0734 0 .0 7 6 0
1.492E-07 -20% 175.46 0.0731 0 .0 7 5 0

1.367E-07 -10% 167.99 0.0730 0 .0 7 4 0 T
1.243E-07 0% 160.17 0.0729
0 .0 7 3 0
1.119E-07 10% 151.95 0.0730
time per iteration

0 .0 7 2 0
9.945E-08 20% 143.26 0.0733
8.702E-08 30% 134.01 0.0739 0 .0 7 1 0
7.459E-08 40% 124.07 0.0750 0 .0 7 0 0 0%

6.216E-08 50% 113.26 0.0767
10%
20%
30%
40%
50%
-40%
-30%
-20%
-10%
Table 5.8- p* and T as a function of relative error in s % ε ρ ρ ο ρ ιν π
Figure 5.13 - Time per iteration as a function of relative error in π
200.00
180.00
160.00
140.00
120.00
100.00 p*
p*
80.00
60.00
40.00
20.00
-
%
%
0%
0%
0%
20
40
-4
-2
% ερρορ ιν π
Figure 5.14 – p* as a function of relative error in π

72
6. Implementation Choices
In order to implement the distributed algorithm, a communication package
was necessary. A communication package is basically a library with functions for
performing parallel operations. The packages we considered can be used with most
programming languages. There are also a number of language-specific parallel
programming languages.
6.1 Distributed programming software
In order to implement our parallel method we examined a number of
distributed programming packages. We focused mainly on PVM, MPI [Geist, 1996],
and BSP [Goudreau et al, 1999]. The package to be chosen had to be able to run on a
network of workstations, not just a Massive Parallel Processor (MPP). Another
concern was the ease of use and the portability. Geist [1996] points strongly to PVM.
He claims that MPI has a rich set of functions for point-to-point and collective
communication, but it does not run well on heterogeneous networks. PVM, on the
other hand, is built with the "virtual machine" concept in mind. It should work in a
heterogeneous environment and could handle dynamic process creation. On the other
hand, PVM has greater overhead and will under-perform MPI on an MPP and even
on a homogeneous network of workstations. If there are many small messages, the
overhead is multiplied.
Another package is BSP (Bulk–Synchronous Parallel). This is a model that
also has implementations coded. Our application might work with this because it can
be synchronized at certain points. That is a main feature in BSP. We ruled out BSP
because it is not widely used.

73
6.2 Sockets
On a lower level of communication are socket calls. The distributed
programming software that was just mentioned in fact makes use of socket function
calls. There are two categories of sockets. One of the categories, TCP sockets, has
built in error checking. It employs a three-way handshake and insures that packets are
received in order. This is the category that is used by the distributed programming
software. TCP sockets cannot take advantage of the Ethernet’s broadcast facility. The
other socket category is known as UDP. This category is not used by the packages but
does allow the Ethernets broadcast facility to be used. More information on socket
programming can be found in [Comer and Stevens, 1996].
6.3 Reasons for choice of both sockets and MPI
Our application does not dynamically allocate processes and it does send
many small messages in the column selection process. We assume it will be run on a
homogeneous UNIX network. This suggests MPI over the other distributed parallel
packages. MPI is also one of the standard packages used and was available to us. If
our network would have processors with different speeds, PVM would be a little
better, although for both of them load balancing would have to be handled by our
program.
MPI is a communication protocol. It specifies communication functions and
what they must do. One implementation is called LAM (Local Area Multi-computer),
which is an MPI environment that allows multiple workstations to work together in
solving one problem [Ohio supercomputer center, 1995]. LAM/MPI is the
implementation that we used; it is given as a library add-on to the programming

74
language. It includes communication functions that the processors on the network can
use to communicate. Two of the functions we use are Allreduce and Bcast. These
were explained in Section 5.3.1 in the context of our method’s communication needs;
its implementation in LAM is described in Section 6.4.
Unfortunately, we were not able to find an implementation of MPI that makes
effective use of the broadcast capabilities of Ethernets. This feature of Ethernets is
essential to the performance of our distributed algorithm.
UDP sockets, on the other hand, allow us to use the Ethernet’s broadcast
capability. This makes a major difference in the scalability of our program. Figures
5.6 and 5.7 show the difference in performance. UDP sockets can safely be used on
local Ethernets where the hardware should deliver the packets in order. The error
checking that is left out in UDP is not necessary on a local Ethernet [Comer and
Stevens, 1996 pg. 13]. MPI is still useful on the larger networks where UDP sockets
cannot be used. We use sockets for empirical testing since they can take advantage of
the Ethernet broadcast, even though both the sockets and the MPI versions were
implemented.
For this reason we decided to directly use socket functions in place of the two
MPI commands. It is interesting to note that Bruck et al [1995] has implemented a
communication package that does take advantage of broadcast.
6.4 The specific commands used
Even though we used sockets, the MPI terminology is still useful. The simplex
method consists, primarily, of one loop. Section 2.1 described the serial simplex
75
tableau method and Sections 5.1-5.2 gave a sketch of the steps for the parallel tableau
method. It was pointed out that within that one loop there are two communications.
First there is computation period followed by a communication period that
gets the maximum bid from each of the processors from its columns and broadcasts
the maximum of these and who the “winning” processor is to all the processors - an
MPI_Allreduce in MPI terminology. After that there is computation by one processor
(row choice). In the second communication, the winning processor broadcasts the
winning column to the rest of the processors; an MPI_Bcast in MPI terminology.
After that there is computation by all the processors.
6.5 A brief description of MINOS, a revised simplex implementation
MINOS is an implementation of the revised simplex method developed at
Stanford University [Murtagh and Saunders, 1983-1998]. It takes as input linear and
nonlinear programs in the MPS format. Our experiments, detailed in Section 7, relied
on comparing our method with the revised method; we used MINOS as our
representative of the revised method. It is important to note that we are comparing our
algorithm with the revised method in general. It is difficult to directly compare it with
MINOS when implementations of the revised method vary widely based on heuristic
differences. This is explained in more detail in Section 7.2.3. MINOS is often used
for research; for one reason, its source code is available. In our experiments we used
version 5.5.
We will detail a few of the MINOS settings we used to help us compare
MINOS to our program. More comprehensive details of how to use MINOS can be
found in the MINOS user's guide [Murtagh and Saunders, 1998]. MINOS takes two
76
files as input: An MPS data file and a specification file. The specification file tells
MINOS the features and parameter values to use when solving the problem. MINOS
defaults to using a crash procedure to get an initial basis [MINOS User’s Guide
Chapter 3]. Our code does not. In order to make comparisons more direct we disabled
it in MINOS, using "CRASH OPTION 0" in the spec file. MINOS will now simply
choose all the slack variables as the basis. We also set "SCALE OPTION 0" so the
problem would not be scaled. This is important because our program and MINOS
have different scaling methods. Below is the MINOS specification file that we used.
BEGIN general
PARTIAL PRICE 1
SCALE OPTION 0
CRASH OPTION 0
MPS FILE 10
PRINT LEVEL 1
PRINT FREQUENCY 1
SUMMARY FREQUENCY 1
END general
MINOS uses Partial Pricing. As noted in Section 3.2, the revised benefits
from a large n to m ratio. Another way the revised can reduce computations is by
avoiding the pricing out of every column. Instead of getting the best value from every
column one can simply choose from a subset of the columns. This is called partial
pricing as opposed to full pricing. By default MINOS will use partial pricing when n
is large. If n is at least 1000 or if n is 4 times m, partial pricing will be used [Murtagh
and Saunders, 1998 Ch. 3]. In order to make comparison more direct we disabled
partial pricing with the line "PARTIAL PRICE 1" so that all columns are searched to
find the entering column.

77
The revised method makes extensive use of reinversions. There are three
reasons for reinversions: a) numerical stability b) to support some degeneracy
procedures [Gill et al, 1987] and c) refactorization of matrices used in the revised
method [Chvátal, 1983]. A full tableau method would only do a reinversion for the
first two reasons. Reinversions for the first two reasons are executed infrequently
whereas refactorizations are quite frequent. Our serial algorithm has a “refresh”
procedure built in. This was usually disabled for the purposes of experimentation
although see Figure 7.9.

78
7. Computational Experiments
Test runs were performed on a number of synthetic problems with matrices of
varying sizes, aspect ratios and densities using our serial method, our parallel method
and MINOS. Section 7.1.1 discusses these synthetic problems. Tests were also
performed on problems in the Netlib library (see Section 7.1.2). These tests were also
used to validate the models and to compare our standard method with the revised
method.
7.1 Problems used for experimentation
For use in experimentation, we needed problems of specific sizes and
densities. At the same time we wanted to use more realistic problems. These
objectives were accomplished by writing our own Synthetic linear program
generators and by also using the Netlib library [http://www.netlib.org/lp].
7.1.1 Synthetic linear programs
We developed three LP generators: generator, generator1, and generator2.
All of them take as input, m = number of rows, n = number of columns, s = the
density of the non-zero coefficients (0 < d ≤ 1), and seed = the seed for the random
number generator; in addition the user specifies a file descriptor for the mps output,
and a problem name.
generator
All the constraints are of the less than equal type. Whether a constraint
coefficient is positive (or zero) is determined randomly with probability s. The value
of a non-zero coefficient is chosen uniformly between 0 and 1. The right hand side
coefficients are all 1. The objective coefficients are all -1, with the exception of those
corresponding to columns that, by chance, end up with all 0 coefficients. In this case
79
the corresponding objective coefficient is set to +1. This prevents unbounded
solutions. Thus excluding these zero columns, we seek to maximize the sum of the
variables. The initial solution determined by all the variables = 0 is feasible so that no
phase I procedure is necessary. There is no guarantee (except the law of large
numbers) that the actual density of the problem is exactly or even close to s. The
program does report the actual density. The output is an mps format file defining the
generated LP.
A major problem with generated problem instances (synthetic problems) is
that there may be covert, underlying structure that makes the problem much more
special than problems that might appear in practice. This was recognized early on by
Kuhn and Quandt [1963]. They proved a theorem that applies to generator, which
gives an asymptotic estimate of the value of the objective. Luby and Nisan [1993]
also gave a fast parallel, approximate algorithm that applies to non-negative
problems; this also applies to the problems generated by generator. Thus there are
obviously special features of this class of problems, which make them easier.
Fortunately, this is revealed more by the number of iterations than the work per
iteration. Since our methods apply to savings within iterations, these considerations
should not affect our results.
generator1
The constraints are generated as in generator, and they are also all less than or
equal constraints. The objective coefficients are now generated randomly between -1
and -0.5. If the column has all zero coefficients in the constraints the sign is reversed.
The right hand side coefficients are also generated randomly, uniformly in the range
80
0.5 to 1. The Kuhn-Quandt Theorem no longer applies, but the Luby-Nisan Algorithm
does.
generator2
In this generator we again have less than or equal constraints. The non-zero
matrix elements are generated uniformly between -1 and 1. The objective coefficients
are generated randomly between -1 and 1. The variables are constrained to be
between -m and m. The constraints are constrained to range between -1 and 1. Again,
setting all variables to 0 is feasible - no Phase 1 is required. Neither the Kuhn-Quandt
Theorem nor the Luby-Nisan Algorithm apply to the results of this generator. Notice
that this (and only this) generator requires the RANGE feature of the solver. The
RANGE feature provides for upper and lower bounds of the constraints as well as the
variables.
Figure 7.1 shows the total time as density increases for the three generators.
Figure 7.2 shows the time per iteration as density increases for the three generators.
Note that the total running times vary widely for the three types of generators while
the times per iteration are very close. This supports our view that the type of synthetic
problems used affect total running time more than the time per iteration.
81
T im e b y G en erato r v s D en sity
40 0
35 0
30 0
25 0
20 0
Time (secs.)
15 0
10 0
50
0
0 0 .2 0 .4 0. 6 0. 8 1 1.2
D en s ity
G e nera tor G en era tor1 G e nera tor2
Figure 7.1 – Total time by generator vs. density

82
Time per Iteration by Generator vs. Density
0.033
0.032
0.031
0.03
Time per Iteration (Secs.)

0.029
0.028
0.027
0 0.2 0.4 0.6 0.8 1 1.2
Density
Generator Generator1 Generator2
Figure 7.2 – Time per iteration by generator vs. density

83
7.1.2 Netlib Problems
Netlib contains a repository of difficult linear programming problems
[www.netlib.org/lp/data, 1996]. These problems are often used as benchmarks for
testing linear programming code. The Netlib problems are in general sparse. Table
7.1 contains a listing of the Netlib problems sorted by their density.

84
Table 7.1: Netlib problems sorted by density

Name Rows Cols Nonzeros density
FIT1D 25 1026 14430 56.26%

FIT2D 26 10500 138018 50.56%
KB2 44 41 291 16.13%
W OOD1P 245 2594 70216 11.05%
AFIRO 28 32 88 9.82%
SHARE2B 97 79 730 9.53%
ISRAEL 175 142 2358 9.49%
ADLITTLE 57 97 465 8.41%
BLEND 75 83 521 8.37%
BEACONFD 174 262 3476 7.62%
FORPLAN 162 421 4916 7.21%
GROW 7 141 301 2633 6.20%
BOEING2 167 143 1339 5.61%
SC50A 51 48 131 5.35%
SCSD1 78 760 3148 5.31%
SC50B 51 48 119 4.86%
RECIPE 92 180 752 4.54%
SHARE1B 118 225 1182 4.45%
E226 224 282 2767 4.38%
BRANDY 221 249 2150 3.91%
STOCFOR1 118 111 474 3.62%
AGG 489 163 2541 3.19%
SCAGR7 130 140 553 3.04%
GROW 15 301 645 5665 2.92%
AGG3 517 302 4531 2.90%
AGG2 517 302 4515 2.89%
BOEING1 351 384 3865 2.87%
SCSD6 148 1350 5666 2.84%
SC105 106 103 281 2.57%
STAIR 357 467 3857 2.31%
TUFF 334 587 4523 2.31%
LOTFI 154 308 1086 2.29%
VTP.BASE 199 203 914 2.26%
BORE3D 234 315 1525 2.07%
GROW 22 441 946 8318 1.99%
DEGEN2 445 534 4449 1.87%
CAPRI 272 353 1786 1.86%
BANDM 306 472 2659 1.84%
SCFXM1 331 457 2612 1.73%
D6CUBE 416 6184 43888 1.71%
SCTAP1 301 480 2052 1.42%
FFFFF800 525 854 6235 1.39%
SC205 206 203 552 1.32%
PILOT4 411 1000 5145 1.25%
SCORPION 389 358 1708 1.23%
SCSD8 398 2750 11334 1.04%
FIT1P 628 1677 10894 1.03%
SHIP04L 403 2118 8450 0.99%
SHIP04S 403 1458 5810 0.99%
85
DEGEN3 1504 1818 26230 0.96%

SEBA 516 1028 4874 0.92%
ETAMACRO 401 688 2489 0.90%
FINNIS 498 614 2714 0.89%
SCFXM2 661 914 5229 0.87%
25FV47 822 1571 11127 0.86%
SCAGR25 472 500 2029 0.86%
PILOT 1442 3652 43220 0.82%
MAROS 847 1443 10006 0.82%
BNL1 644 1175 6129 0.81%
PILOT.JA 941 1988 14706 0.79%
STANDATA 360 1075 3038 0.79%
PILOT87 2031 4883 73804 0.74%
STANDGUB 362 1184 3147 0.73%
STANDMPS 468 1075 3686 0.73%
NESM 663 2923 13988 0.72%
SCRS8 491 1169 4029 0.70%
PEROLD 626 1376 6026 0.70%
PILOTNOV 976 2172 13129 0.62%
SCFXM3 991 1371 7846 0.58%
QAP8 913 1632 8304 0.56%
GFRD-PNC 617 1092 3467 0.51%
SHELL 537 1775 4900 0.51%
SHIP08L 779 4283 17085 0.51%
MAROS-R7 3137 9408 151120 0.51%
SHIP08S 779 2387 9501 0.51%
PILOT.W E 723 2789 9218 0.46%
CZPROB 930 3523 14173 0.43%
TRUSS 1001 8806 36642 0.42%
W OODW 1099 8405 37478 0.41%
SCTAP2 1091 1880 8124 0.40%
CYCLE 1904 2857 21322 0.39%
MODSZK1 688 1620 4158 0.37%
SIERRA 1228 2036 9252 0.37%
SHIP12L 1152 5427 21597 0.35%
SHIP12S 1152 2763 10941 0.34%
GANGES 1310 1681 7021 0.32%
D2Q06C 2172 5167 35674 0.32%
SCTAP3 1481 2480 10734 0.29%
GREENBEA 2393 5405 31499 0.24%
GREENBEB 2393 5405 31499 0.24%
STOCFOR2 2158 2031 9492 0.22%
BNL2 2325 3489 16124 0.20%
QAP12 3193 8856 44244 0.16%
FIT2P 3001 13525 60784 0.15%
80BAU3B 2263 9799 29063 0.13%
QAP15 6331 22275 110700 0.08%
DFL001 6072 12230 41873 0.06%
STOCFOR3 16676 15695 74004 0.03%
Table 7.1 - Netlib problems sorted by density

86
7.2 Validation of Performance Models
7.2.1 Computation verification
Computation in dpLP (Dantzig rule)
Section 5 provided running time projections for our serial and parallel
programs. We used our models to pick the optimal number of processors to use. We
then were able to compare the running times of both our parallel and serial algorithms
with the revised simplex algorithm and to characterize which types of problems our
methods are good for. This analysis shows the advantages of our dpLP parallel
program for all problem sizes. This was discussed in Section 5. The parallel
program’s expression had a computation part and a communication part. We also had
a separate computation expression for the steepest edge column choice rule. In this
section we validate those expressions by comparing the actual running times for
linear programs with our projections.
In order to verify the estimations given in Section 5, we first have to estimate
the coefficients of the terms in our expressions. These coefficients would vary with
the computing environment (machines and network). In Section 5 we gave three
expressions that can be verified in our environment. Two are for computation, one for
the standard column choice rule and one for the steepest edge rule. One is for
communication, which doesn’t depend on the column choice rule.
Similarly, for our serial full tableau program we have two expressions and for
the revised (MINOS) program we have one expression.
The coefficients required for the computation expressions are a) column
choice time per unit vector element (ucol), b) row choice time per unit vector element
(urow), and c) pivot time per unit matrix element (upiv). These constants are defined
87
in Section 5.5. The constant terms required for the communication expressions are s
and g. An explanation of these constants is also given in Section 5.5.
There are, in general, two methods that we employed to get the coefficients.
One is by directly estimating those coefficients. The second method applies linear
regression to actual runs of the linear programming code to estimate the values of the
coefficients. If the regression produces a tight fit we can be confident that the
coefficients are accurate and that the expression correctly estimates the running times
of the programs.
To directly estimate a coefficient we time the specific code segment
corresponding to that coefficient. We then divide the time by the variables that are
multiplied by it in the expression. For example, in order to find upiv we time the
pivot function call. We then divide it by mn since mn is multiplied by upiv in the
expression.
To verify the expressions, we generated a series of dense problems. We used
these problems together with problems from the Netlib library. In order to verify the
parallel dplp expressions the problems were run in parallel using multiple processors.
The smallest number of processors used was 2 and the largest was 7.
Figure 7.3 plots time per iteration against mn for the standard column choice
rule. Figure 7.4 is a similar graph for the steepest edge rule and is explained later in
this section. In Figure 7.3 the coefficient upiv dominates, especially for large
problems. This is because the pivot step in fact takes about 95% of the computation
time. The other two coefficients can actually be left out of the expression. One can
see from the figure that as mn grows so does the running time. The points of the
88
actual running times almost completely lie on the projected value line. This verifies
that the running time is almost completely dependent on mn. Since mn is the pivot
term of the expression, it justifies our leaving out the other terms. In fact, regression
was only used to find out the value of upiv. The other coefficients were not accurately
estimated when using regression, probably due to the fact that upiv dominates the
other coefficients.
Below are the values obtained using the direct timing of the 3 steps of an
iteration. These values come from Table 5.1 and 5.2 in Section 5.4. We also include
upiv as estimated using regression even though its value was not used in the formulas.
ucol_se is the unit cost for the steepest edge column choice rule. This value is
only used in the steepest edge verification further in this section.
Ucol 3.73E-8
Urow 1.65E-06
Upiv 1.24E-07
upiv (regression) 1.27E-07
ucol_se 3.47E-07
Only the pivot coefficient and term of the expression are used to estimate the
timing. The other terms are negligible. The estimation was applied to a number of
problems in the Netlib library in addition to generated problems. The relative
percentage error of our estimate to the observed timing was calculated by
obsevation − estimate
100 .
estimate
The mean percentage relative error observed amongst these problems was
5.00%. It is important to note that most large problems had a relative percentage error
89
of less than one percent. Unfortunately a few of the smaller problems gave larger
errors, which pulled the average up. The maximum relative error was 19.25%. As we
explained, the time taken during the pivot step takes the main bulk of time. The time
taken for other steps are relatively insignificant. For small problems, that assumption
is not as strong which can cause a larger relative error.

90
1.4
1.2
0.8
0.6
Iteration time (s)

0.4
0.2
0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
mn
Actual Iteration times projected iteration times
Figure 7.3 – Iteration time vs. mn (classical column choice rule)

91
4.5
3.5
2.5
Iteration time (s)

1.5
0.5
0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
mn
Actual iteration times Projection when 100% SE columns looked at Projection when 0% SE columns looked at
Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule)
92
Serial program (retroLP) verification
The serial time expression is essentially the same as our parallel expression.
The only difference is that it uses only one processor. We used the same coefficients
obtained for the parallel program’s expression for the serial expression. We executed
the serial program for the same group of problems described above. We then took the
average error between the estimation and its actual running time. Our serial program
gave 15.34% and 7.73% for the maximum and average relative percentage errors
respectively. Again the small problems pulled up the mean relative percentage error.
Steepest Edge verification
The expression for the steepest edge rule is different from the computation
expression only in the column choice part. Instead of having to do m comparisons we
now must do at most mn multiplications; m multiplications for each of the eligible
columns (see Section 3.3). This could roughly double the number of significant
operations compared to using the standard column choice rule. Based on Table 5.2
this will actually cost, on our network, between three and four times the total
computation time per iteration compared to using the standard column choice rule.
This assumes that all the columns are eligible. In practice, however, many columns
are not eligible. In one empirical test we found that only 35% of the columns were
eligible; the other columns could be immediately eliminated. The cost of an iteration
is therefore upper bounded by twice the number of operations and between three and
four times the time cost of an iteration (on our network) when the standard column
choice rule is used. This upper bound is rarely reached. For the steepest edge column
choice rule, therefore, the se_ucol coefficient is also significant. Note that this
coefficient is different than the ucol coefficient in the standard column choice rule
93
discussion. se_ucol here represents the unit cost of doing the steepest edge column
choice rule assuming all columns are looked at. The value of se_ucol was listed
earlier in this section.
In order to accurately estimate the running time of the program when using
steepest edge we must know the percentage of columns actually looked at during the
program. This percentage would then be multiplied by ucol. This is not known before
a program is solved and we therefore cannot accurately estimate the running time in
advance. It can nevertheless be used to show the accuracy of the expressions as in
Figure 7.3. Furthermore if we use the generic 35% number mentioned above we do
get a reasonably good estimate of the running time. Assuming we know the number
of columns actually looked at during execution, 17.72% and 8.22% are the maximum
and relative percentage errors respectively.
A graph similar to the one shown for the standard column choice rule is
provided in Figure 7.4. The horizontal axis, as in Figure 7.3, is the problem size mn.
Figure 7.4 shows two lines. The top line corresponds to problems where the program
looks at every column within the steepest edge column choice rule. The bottom line
corresponds to a problem where the program looks at none of the columns within the
steepest edge column choice rule. In practice a percentage of the columns are looked
at as we just explained. Note that the actual running times per iteration fall in between
these projected lines. This verifies our steepest edge rule timing projection.
Graphs of total running times are provided in Section 7.3.
7.2.2 Communication verification (using regression for coefficients)
Communication time
We now discuss the following issues:

94
i) User time vs wall clock time
ii) Regression vs direct timing
iii) Network idle time
iv) Socket barrier commands vs. MPI_Barrier to separate wait_time and
communication time
v) Separation of communication time from wait time
vi) Timing with both 2 processors and with up to 7 processors.
As in computation, there are two general ways of estimating the
communication coefficients s and g. One is to use s and g calculated by
experimentation using measurement programs. The other is to use regression on
timings of the actual LP programs.
The timing for communication is the wall clock time. It is important to run
communication tests during network idle time to avoid time accruing from other
processes running.
Another issue is to make sure that the communication times are accurate for
more than two communicating processors. To this end we estimated s and g in the
context of broadcast and all reduce. This verified the accuracy of s and g even when
more than two processors were involved in the communication.
Communication time can be divided into two parts. First, before the first
message can be read, the reading processor might be waiting for the sender to finish
its computation. This is referred to as wait time. The second part is the actual
communication time. We initially divided the two by putting an MPI_Barrier
command before the timing of the communication within the program. The only need
95
for an explicit barrier command is for this particular timing test. This command
separates the effect of processors waiting for other processors from the
communication itself. The comparison would then be on the communication time
without the effects of wait time.
The MPI_Barrier command itself has significant overhead. It actually adds 7
ms to the wait time. We timed this by putting a number of barrier commands inside a
loop.
Instead of MPI’s barrier command, a sequence of read and send commands in
the socket interface was substituted. This surprisingly decreased not only the wait
time but also the communication time.
It is unclear why there was a decrease in the communication time. It might be
because after the processor enters the barrier, it lets the processors leave at slightly
different times. The new socket barrier method seems to take away most of the
overhead the MPI barrier had. For the problems tested, the wait time plus the
communication time were almost the same as the communication time that was
obtained when there was no barrier statement.
Table 7.2 compares percentage errors in two groups of problems. The first is a
group of 24 problems each of which was executed using 2 processors. The wait time
was separated from the communication time by use of socket calls. The second group
used the same problems as the first group. This group, though, was executed once
using 2 processors and then with 3 processors… all the way up to 7 processors. This
gives a total of 144 runs. The table contains both the maximum and average relative
percentage error for these two problem groups. The rows correspond to s and g values
96
derived from different sources. The first row shows the s and g that result from direct
experimentation. The bottom two rows show the s and g that result from regression.
Table 7.3 lists the 24 problems that were used with their sizes and densities. The first
10 problems, with names beginning with “d” are synthetic problems generated by
generator (the first one). Note however, that for this experiment the densities are
irrelevant since we are not using the revised method.
Only communication time; no computation or wait time.
Using socket barrier method; no MPI_Barrier.
max % off, avg % off data1 (2 p) data2(2 to 7 p)
From experiments s1, g1 1.36E-04 273.41% 170.65%
1.61E-07 146.97% 77.47%
regression on 2 p with socket Barrier s3, g3 0.000192 31.66% 33.28%
1.50E-06 12.67% 12.96%
regression on 2-7 p with socket Barrier s4, g4 0.0002 64.45% 28.75%
9.11E-07 24.94% 4.70%
Table 7.2- percentage errors in both groups of problems.

97
name m n d
d500x5000 500 5000 100%

d100x4000 100 4000 100%
d200x2000 200 2000 100%
d100x3000 100 3000 100%
d100x2000 100 2000 100%
d100x1000 100 1000 100%
d100x1000 100 1000 100%
d100x500 100 500 100%
d100x100 100 100 100%
d10x100 10 100 100%
share2b 96 79 9.53%
share1b 117 225 4.45%
stair 356 467 2.31%
ship04l 402 2118 0.99%
ship04s 402 1458 0.99%
standata 359 1075 0.79%
standmps 467 1075 0.73%
standgub 361 1184 0.73%
shell 536 1775 0.51%
ship08l 778 4283 0.51%
ship08s 778 2387 0.51%
woodw 1098 8405 0.41%
ship12l 1151 5427 0.35%
ship12s 1151 2763 0.34%
Table 7.3-The 24 problems used
These tests were repeated on the large problems. (Four of the 24 problems
were excluded.) In this set of 20 problems one had 50,000 matrix elements, and the
other 19 had at least 100,000 elements. The results were virtually the same (within .5
of a percent) as when we had all 24 problems.
Wait time
Wait time is the time that processors spend at a synchronization point waiting
for other processors to finish computation. In general this time should be short if the
load is evenly distributed amongst the processors. This waiting time is actually a
function of the computation parameters m, n, and p. The longer the computation, the
longer two different processors might vary in their computation time. For the classical
column choice rule, the large cost of pivoting is what causes most of the wait time.
98
The classical column choice rule is insignificant in terms of time. In the steepest edge
rule, both the cost of pivoting and column choice heavily contribute. Only one
processor does the row choice, which is why it does not contribute to wait time but is
instead considered computation time.
We timed many pivots on a constant size dense matrix. We found very small
random discrepancies in time between the pivots. Each pivot step does the same
number of calculations. Since the discrepancies were very small and random, we
assume it comes from something random within the computer. For each iteration the
processors must wait for the slowest pivot. This wait time of one iteration should be
equal to the maximum pivot time of the processors minus the minimum pivot time of
the processors. Sum this per-iteration wait time over all iterations. This sum should be
the total wait time.
In our small problems, the computation time is much larger than the
communication time. As a result, the wait time is greater than the communication
time. This should change as the optimal number of processors is reached. At that
point, communication will be approximately as costly as computation. The
communication time will then be much larger than the wait time.
7.2.3 Analysis of the revised program (MINOS) expression
The revised method has several variants. They all go through the same basic
steps that use the inverse of the basis or some functionally equivalent representation.
For a more detailed discussion see [Nash, 1996] and [Chvátal, 1983]. The basic steps
are as follows:
A. Pricing out the c (objective) vector.
B. Choosing an entering basic variable.

99
C. Constructing the entering column.
D. Choosing the pivot row
E. Updating the inverse or its functional equivalent.
Steps A and C make use of the “basis inverse” while step E keeps the “basis
inverse” current. Step E executed every iteration for the case of the explicit inverse. If
the inverse is stored as a factorization made up of other simple matrices, it is only
executed once every number of iterations. These “simple” matrices correspond to
some triangular matrix decomposition of the basis matrix such as LU decomposition.
In the latter case, step E is known as a refactorization and can cost up to m3
depending on sparsity, as explained below. When an explicit inverse is maintained,
step E has a cost of about m2, where m is the rank of the basis.
A procedure similar to a refactorization, which we call refresh is sometimes
performed even in the explicit inverse for the sake of numerical accuracy. Refresh is
much more infrequent and is not discussed here.
A very big factor in the running time of the revised method is sparsity. There
are two types of sparsity. The first is the sparsity of the original data. The second is
the sparsity of the inverse or its equivalent. Fill-in is the term used when the “inverse”
representation starts accumulating nonzeros.
Steps A and C can always take advantage of the first type of sparsity. The
explicit inverse representation of the revised can only be expected to take advantage
of the first type of sparsity. This is because there is only one inverse and in general
the inverse of a matrix will be dense even for a sparse matrix. On the other hand,
there are many possible factorizations of a matrix. This allows a “smart” factorizing
100
construction to choose one with very little fill-in. This is implemented by heuristically
choosing pivot elements that result in a sparse factorization. This allows inverse
factorization methods to take advantage of both forms of sparsity.
The Markowitz ordering scheme used in MINOS is an example of this. Steps
A, C and E, in these schemes, can take advantage of the second type of sparsity. Eta
factorization and triangular factorization are two ways of factorizing the inverse.
MINOS uses triangular factorization. It adds Eta vectors for each pivot until the next
refactorization.
The MINOS User’s guide says [Murtagh and Saunders, 1998]: “MINOS
maintains a sparse LU factorization of basis matrix B using a Markowitz ordering
scheme and Bartels-Golub updates, as implemented in the LUSOL package of Gill et
al [1987]. For a description of the concepts involved see Reid [1976, 1982]. The basis
factorization is central to the efficient handling of sparse linear and nonlinear
constraints.“ MINOS therefore takes advantage of both types of sparsity.
MINOS uses many heuristics to speed up computation. One heuristic is the
“LU density tolerance.” It changes the refactorization based on density. MINOS 5.5
defaults to refactorizing every 100 iterations. This can be changed. It is difficult to
come up with a performance model for MINOS that takes all the heuristics into
account.
We can make an expression for the revised method that would take the first
type of sparsity into account. In Section 5 the graphs and discussion assumed an
expression that uses the explicit inverse form of the revised method. This is not the
way MINOS implements the revised but it’s close to the upper bound when the
101
second kind of sparsity is assumed not to occur. The sparsity in the functional
equivalent of the basis inverse is unknown before solving the problem. The revised
expression can theoretically be verified by going into the MINOS source code and
calculating the fill in that occurs every iteration, similar to what we did for the
steepest edge expression in our full tableau program. MINOS is not our code and we
didn’t do that.
We can though, show a comparison of dpLP to MINOS as the problem
density rises and as the number of processors rises. This is shown in the next section.
7.2.4 Revised vs. retroLP and dpLP
Figure 7.5 corresponds to Figure 5.3 and Figure 7.6 corresponds to Figure 5.6
and 5.7. Note that Figure 5.7 uses more processors than we have. Tables 7.4 and 7.5
correspond to Figures 7.5 and 7.6 respectively.
We ran a problem of size 1,000 by 5,000. For Figure 7.5 and Table 7.4 we
used 5% density. We stopped the program after 500 iterations. For those few that did
500
not run for that many iterations, we scaled the time it took by time . This gives
pivot
the time it would take for 500 iterations. Only 3 problems needed this.
Figure 7.5 and Table 7.4 compare MINOS and retroLP over varying densities.
We can see that for this problem, at somewhere between 70% and 80% density,
retroLP takes over MINOS in speed.
Figure 7.6 and Table 7.5 compare MINOS and dpLP on a problem with 5%
density. The number of processors is increased until 7. The optimum value is in fact
about 53 processors. MINOS takes 24.24 seconds whereas dpLP when run on 7
processors speeds up to 45.64 seconds. The computational model given in Section 5

102
predicts a time of 41.72 seconds for dpLP on 7 processors. That is an accurate
prediction. The same model predicts a running time of 11.73 seconds if dpLP would
be run over the optimal number of processors. From the graph, we can also see the
steady speedup as the number of processors increases. It was not leveling out at 7
processors.
103
450
time per 500 iterations

400
350
300
250
200
150
100
50
0
%
%
5%
10
20
40
50
60
70
80
90
0
10
Density
Revised-MINOS Serial-retroLP
Figure 7.5-Actual timing as a function of Density.
350
time per 500 iterations
300
250
200
150
100
50
0
1 2 3 4 5 6 7
Processors
Revised-MINOS at 5% Density Parallel-dpLP
Figure 7.6- Actual timing as a function of p

104
Density Revised-MINOS Serial-retroLP

5% 24.240 306.640
10% 43.630 306.640
20% 82.314 306.640
40% 148.213 306.640
50% 194.070 306.640
60% 242.720 306.640
70% 285.060 306.640
80% 323.440 306.640
90% 352.384 306.640
100% 397.718 306.640
Table 7.4- Actual timing as a function of Density
Processors (for dpLP) Revised-MINOS-5% Parallel-dpLP

1 24.240 306.640
2 24.240 155.747
3 24.240 108.617
4 24.240 77.479
5 24.240 65.574
6 24.240 53.287
7 24.240 45.638
Table 7.5- Actual timing as a function of p

105
7.3 Total Time Comparisons
As noted in Sections 4.1.2 and 5.11, one of the advantages of using a full
tableau parallel algorithm is the ability to take advantage of more complicated column
choice rules. Figure 7.7, “retroLP vs. MINOS”, shows total running time as a function
of density for problems with m=500 and n=1,000. It shows retroLP with both the
Dantzig and steepest edge column choice rules. It also shows MINOS (revised
method). Figure 7.8, “Speedup Relative to MINOS (m=500, n=1,000)”, shows
MINOS time divided by retroLP time as a function of density for the same data. The
density at which the Dantzig column choice rule takes over MINOS is around 70%.
The density at which the steepest edge column choice rule takes over MINOS is
between 2% and 5%. The points in both of these figures represent nine runs each, one
run for each of the three generators combined with three different seeds.
Figure 7.9, “Speedup Relative to MINOS (m=500, n=1,000)”, is similar to
Figure 7.8. It shows MINOS time divided by retroLP time as a function of density for
problems with m=1,000 and n=5,000. These runs executed a tableau reinversion once
every 5,000 iterations. This reinversion cost is very close to 20% extra time for the
Dantzig column choice rule and about 15% extra time for the steepest edge column
choice rule. This is why the Dantzig version doesn’t catch up with the revised in this
figure.
It should be noted that although partial pricing doesn’t help in retroLP for the
classical column choice rule, it would make a big difference in the steepest edge rule.
106
The timing in this section was done on the PC environment mentioned in
Section 5.3. The UNIX environment was used for all the other timing.
107
retroLP vs. MINOS
350
300
250
200
Time (secs.)
150
100
50
0
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Density
retroLP MINOS Steepest Edge
Figure 7.7 – retroLP vs. MINOS

108
Speedup Relative to MINOS (m=500, n=1,000)
2.5
1.5
Time (secs.)
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
0.5
0
Density
Steepest Edge Dantzig Choice
Figure 7.8 – Speedup relative to MINOS (m=500, n=1,000)

109
Speedup Relative To MINOS (m=1000, n=5000)
3.5
2.5
Time (secs.)
1.5
1
0 0.2 0.4 0.6 0.8 1 1.2
0.5
0
Density
Steepest Edge Dantzig Choice
Figure 7.9 Speedup relative to MINOS (m=1,000, n=5,000)

110
8. Summary, Applications and Future Work
8.1 Summary
In conclusion our method has made large linear programs more tractable. It is
especially good for large dense problems or when we have use of the optimal number
of processors (even for problems that are not dense). By taking advantage of
parallelism it can also divide the extra load of alternate column choice rules, which
shorten the number of pivot steps required to solve the problem.
We have
1) Implemented a good general-purpose simplex program, called retroLP, using
the full tableau method. This implementation runs both on UNIX machines
and on PC’s
2) Extended it to work on distributed systems using both MPI and IP
programming. This is called dpLP.
3) Developed performance models for both computation and communication to
optimize the number of processors. Although our network allowed verification
for only an Ethernet broadcast model, we gave expressions for several
different communication models.
4) Determined at what density our method becomes more efficient than existing
revised simplex implementations
5) Analyzed the number of processors needed to make our parallel method more
efficient than existing revised simplex implementations even for sparse
problems
111
6) Analyzed when the other column choice rules, in particular the steepest edge
column choice rule, can help our parallel method achieve faster total running
times than existing revised simplex implementations. This efficiency is
achieved at lower densities and with fewer processors than when using the
classical column choice rule.
8.2 Applications with dense matrices
There are a number of applications that lead to dense linear programs. One is
data mining [Bradley, 1999] and text categorization using the method of [Bennett and
Mangasarian, 1992], [Bosch and Smith, 1998]. The idea is, given a collection of
different articles and a group of categories, to put each article into its proper category.
We can take a document and for a given category decide whether or not the document
is a member of that category. First n keywords are chosen to help distinguish between
categories. The variables in the LP correspond to these words. For each article each
keyword is counted to get its frequency in that article. The vector of these frequencies
defines a point in n space, which corresponds to a row in the LP. For most groups of
words the resulting tableau will be sparse. If instead of using individual words we
aggregate groups of words, the problem will become smaller and denser. The
grouping is known as feature compression and extraction. One solves the resulting
dense linear program to find a hyperplane that separates the points in the category
from the points out of the category.
Digital Filter Design [Steiglitz, 1992] and De-noising of images [Mallat,
1999 pg. 419] give rise to other dense applications. LP Relaxations of Machine
Scheduling Problems [Uma, 2000] is another dense application. The idea is to

112
schedule a number of tasks in such a way as to minimize the total time elapsed. The
rows correspond to the points in time. The variables (columns) correspond to the
tasks.
Eckstein et al [1995] cite other dense applications such as dense master
problems sometimes generated by the Dantzig-Wolfe or Benders decomposition,
digital filter design, data analysis and classification and financial planning.
8.3 Future work
A. Other kinds of bids:
a. To analyze whether using dpLP with other column choice rules such
as exterior pivoting [Eiselt and Sandblom, 1985, 1990] would decrease
overall program time. This would be an extension to our analysis of
the steepest edge method.
b. To analyze block pivoting. This is another column choice method. A
whole group of non-basic variables is chosen to enter the basis at once.
These variables correspond to a “block” of columns and pivot rows.
It would be interesting to see how this method would do in the context
of retroLP and dpLP [Padberg, 1995 pgs. 70-75].
B. Special structures. Many Linear programming problems have special
structures.
a. One example is the “stepwise block structure.” This has the
following form:
113
Maximize c1 x1 + c 2 x 2 + c3 x 3 + c4 x4 + L L c n −1 x n −1 cn xn
Subject to a11 x1 + a12 x 2 op b1
a 21 x1 + a 22 x 2 op b2
a33 x3 + a34 x 4 op b3
a 43 x3 + a 44 x 4 op b4
L L M M
L L M M
a m −1n −1 x n −1 a m −1n x n op bm −1
a mn −1 x n −1 a mn x n op bm
where op refers to any of the relations =, <= or >= and the
variables can be bounded from above and below.
b. Using some dedicated processors for column generation.
It would be interesting to see how retroLP and dpLP could be specialized
for linear programs with special structures [Hadley, 1962].
C. To further analyze divvying up unequal numbers of columns to the
processors. There are three possible reasons.
a. To avoid network contention between processors. When each
processor offers its “bid” during column choice, it is possible for
multiple messages to be transmitted simultaneously if processors
finish their pivot and column choice at the same time. This might
actually slow down communication. One way around that would
be to make sure processors finish at slightly different times by
giving them unequal loads. This was discussed in Section 5.8.
b. Extensions to heterogeneous computing (processors) (load
sharing). Clearly if the processors are not similar in terms of speed

114
and memory we would try to even them out by giving them
different computation loads.
c. Heterogeneous communication as explained below.
D. Extensions to heterogeneous communications. These are two possible
enhancements for the case of using networks other than a simple Ethernet.
a. Load sharing to compensate for the extra delay caused by
communication from outside networks.
b. Using MPI, TCP sockets or UDP sockets with error checking.
E. Dynamic load balancing and fault tolerance. Figuring out how to deal with
varying processor availability due to congestion or failure. Looking into
schemes such as column duplication or regeneration.
F. Use of partial pricing for the steepest edge rule in the full tableau method.
115
Appendix A. Form of Linear Program input: LPB and MPS
In this appendix we describe:
A. The internal linear programming format, LPB, used by our programs.
B. The MPS format
Each of these is illustrated by an example.
A general linear program is of the form:
Maximize z = cx
x
Subject to Ax op b
lj ≤ xj ≤ uj j = 1,..., n
where op refers to any of the relations =, <= or >= .
A is the constraint matrix. l and u are the lower and upper bounds respectively.
x is a vector of unknowns and b is a vector of the right hand side values.
lj and uj can be negative or positive infinite. If both bounds of a particular variable are
infinite, the variable is said to be “free.” If both bounds of a particular variable are the
same (lj=uj), the variable is said to be “fixed.” If lj is not equal uj and both are finite
the variable is said to be “bounded.” To be consistent with the C programming
language we often denote vectors by "[ ]" and matrices by "[ ][ ]."
A.1 Preprocessing into LPB format
retroLP and dpLP both use the simplex method with bounded variables. They
require a two-dimensional array, two integers and three vectors as input.
a[m+2][n+1], m, n, vectors nl[n], nu[n], ntype[N].
(a[ ][ ], nl[ ], nu[ ], m and n are actually passed via the structure given in Table A.1
116
The output of the simplex is in the a[ ][ ] matrix at the end. Another function
extracts that information and outputs it.
Much of the data is stored in the matrix a[ ][ ] which has m+2 rows and n+1
columns where n is the number of variables and m is the number of constraints. The
0th column holds the b vector and the 0th row holds the c (objective) vector. The
(m+1)th row stores the Phase 1 objective. Constraints in the matrix can be a mixture
of less than, greater than and equalities. Vectors nl[ ], bl[ ], nu[ ] and bu[ ] hold the
upper and lower bounds of the variables. nrange[ ] and brange[ ] are lists of flags
indicating whether a variable is currently in between its bounds, lower than a lower
bound, at its upper bound or at both bounds (for fixed variable only). The values of
nrange and brange are determined by the program and need not be input.
These data structures describing the linear programming problem are all
stored within the data structure given in Table A.1.

117
typedef struct
{
char *name; // name of problem (usually file name (100))
long m; // number of rows
long n; // number of columns
// (actually, the matrix is (m+2)x(n+1))
long mm; // index of the current objective row; m+1
// for Phase I; 0 for Phase 2.
double ** a; // points to the constraint matix ((m+2)x(n+1))
// (n+1)
var_type *ntype; // types of non-basic variables: fixed.
// lower bounded, upper bounded, both, free. (n+1)
double *nl; // lower bounds of non-basics n
double *nu; // upper bounds of non-basics n
long *jnonbasic; // indices of current non-basic variables
var_range *nrange; // non-basic values within, at, below,
// or above bounds.
double *x; // current value of non-basic variables (to
// implement EXPAND) (n+1)
var_type *btype; // types of basic variables: fixed. lower
// bounded, upper bounded, both, free. (n+1)
double *bl; // lower bounds for basic variables
double *bu; // upper bounds for basic variables
long *ibasic; // indices of current basic variables (m+1)
var_range *brange; // basic values within, at, below,
// or above bounds.
double *b; // current value of basic variables
} LP_state;
Table A.1
Table A.1-Data structure for retroLP and dpLP
Here is an example of how an outside driver program would preprocess a
linear programming problem and fill the data structures just mentioned.
118
Maximize z = 2x + 2 y − 5z
Subject to 5x − 4 y + 3z ≤ 4
5x + 3y + 3z ≥ 2
2x + 3 y − 4 z = 10
2 ≤ y ≤ 10, x, z ≥ 0
First add a slack, surplus and artificial to the constraints (this can be done
implicitly).
Maximize z = 2 x + 2 y − 5z
Subject to 5x − 4 y + 3z + s1 = 4
5x + 3 y + 3z + s 2 = 2
2x + 3 y − 4 z + s3 = 10
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0
Solve for si for all i.
Maximize z = 2x + 2 y − 5z
Subject to s1 = 4 − 5 x + 4 y − 3z
s2 = 2 − 5x − 3 y − 2 z
s 2 = 10 − 2 x − 3 y + 4 z
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0
This last representation is called a “Dictionary.”
The A matrix is filled with the coefficients of the dictionary.
0 2 2 −5
4 −5 4 −3
A[ ][ ] = 2 −5 −3 −2
10 −2 −3 4
0 0 0 0
nl[ ] = 0 2 0
nb[ ] = 0 0 0
nu[ ] = INF 10 INF
nb[ ] = INF INF INF
119
nrange[ ]=L L L - (U or L, BOTH or FREE), where U=at its upper bound, L=at
its lower bound, BOTH means it’s a fixed variable, and FREE means it’s a free
variable (not bounded on either side).
There are 3 nonbasic variables thus n=3.
There are 3 basic variables (a slack, surplus and artificial variable) corresponding
to 3 constraints thus m=3.
The top (0th) row is the objective the bottom (m+1)th row is place for a phase 1
objective. The first (0th) column is the right hand side constants. The resulting
matrix is m+2 by n+1.
The 0th column of a[ ][ ] must be all nonnegative.
For a minimization problem the objective must be negated.
A.2 The MPS format
MPS is a standard format originally developed by IBM for describing
linear and integer programs. More details on MPS format can be found in
Murtagh [1998]. It is the format currently supported by our programs. MPS has a
fixed and free format. The fixed format is the one used by MINOS and our code;
it is one we will describe. Each row in the file has fields, which are in the specific
column locations given in figure A.1.

120
Field 1 Field 2 Field 3 Field 4 Field 5 Field 6

Columns 2 − 3 Columns 3 − 12 Columns 15 − 22 Columns 25 − 36 Columns 40 − 47 Columns 50 − 61
Figure A.1
NAME, ROWS, COLUMNS, RHS, BOUNDS, RANGES and ENDATA are
keywords delimiting the different sections of the file. They all begin in column 1
of their respective rows. The row starting in column 1 with “NAME” can have an
8-character problem name in Field 2. Every row in the ROWS section has a one-
character keyword (N, E, L, G) in Field 1 followed by a row name in Field 2. The
character in Field 1 specifies the type of constraint that row will contain. There
are four possible row types. They are an objective (N), an equality (E), a less than
(L) or a greater than (G).
The COLUMNS section consists of a column name in Field 2. Fields 3 and 4
contain a row name - value combination. We then have a value to put into the row
and column given in Fields 3 and 2 respectively. Fields 4 and 5 contain another
row name-value. Fields 4 and 5 can be left blank. It is also legal to leave Fields 4
and 5 blank.
The RHS section consists of a right hand side (rhs) name in Field 2. Fields 3
and 4 contain a row name-value combination. Just as by the column section, we
then have a value to put into the row and rhs given in Fields 3 and 2 respectively.
(The MPS format supports multiple right hand sides; our implementation allows
only one.) Fields 4 and 5 contain another row name-value. Fields 4 and 5 can also
be left blank.
121
Every row in the BOUNDS section has a two-character keyword (UP, LO,
FX, FR) in Field 1 followed by a bound name in Field 2. (The MPS format
supports multiple bounds; our implementation allows only one.) The keyword in
Field 1 specifies what type of bound the variable (column name) specified in
Field 3 will be. There are four possible bounds. They are an upper bounded
variable (UP), a lower bounded variable (LO), a fixed bounded variable (FX) or a
free bounded variable (FR). Fields 3 and 4 contain a column name-value
combination that specifies a value in the column name (variable) given for the
bound in Field 2 (for a given problem solution there will be only one bound).
As an example the following is an LP in MPS format given in a text file:
NAME TESTPROB
ROWS
N COST
E EQ1
E EQ2
COLUMNS
XONE EQ1 1
XTWO EQ2 1
XTHR COST - 10 EQ1 - 1
XTHR EQ2 -1
XFOUR COST 100 EQ1 1
XFOUR EQ2 1
RHS
RHS1 EQ1 2 EQ2 3
BOUNDS
UP BND1 XTHR 1
UP BND1 XFOUR 1
LO BND1 XFOUR - 10
ENDATA
122
There are 3 rows; the first is row "COST" which is an objective row denoted by
keyword N. The second and third are rows called "EQ1" and "EQ2" which are
equality rows denoted by keyword E. Another two keywords not in this file are G and
L for greater than and less than constraints. There are four columns with names
"XONE", "XTWO", "XTHR" and "XFOUR". On the right of the column name are
one or two row names with values indicating all the values for that column. Next the
right hand side (b vector) is given in the same way the columns were given. Finally
there are three bounds. Two upper bounds denoted by keyword UP and one lower
bound denoted by LO. There are another two types of variable bounds; FX for fixed
variable and FR for free variable. There is also another section called RANGES.
(BND1 is just the "name" given to the bound in case there is another set of bounds.
RHS1 is the same. Usually there is only one RHS and one set of bounds.) Our
implementation only looks at the first set of bounds or RHS’s if there are more than
one. All values not mentioned are assumed to be 0. The problem can be a max or min
although it is usually assumed to be min. If bounds are not given for a variable
(column) a lower bound of 0 and an upper bound of +INF are assumed.
The above MPS file corresponds to the following LP:
XONE XTWO XTHREE XFOUR

COST : − 10 100
EQ1 : −1 1 −1 =2
EQ2 : −1 1 −1 =3
XONE, XTWO ≥ 0 , 0 ≤ XTHREE ≤ 1 , - 10 ≤ XFOUR ≤ 1
Further information on the MPS format can be found in [Murtagh, 1998] and at
ftp://softlib.cs.rice.edu/pub/miplib .
123
Appendix B. Program description: retroLP and dpLP
retroLP is a full scale implementation of the standard simplex method written
in C++ compatible C. It takes input in the MPS format and supports all the options for
linear programming implied by the format except that multiple runs are not yet
supported. That is, our implementation expects at most one set of right hand side
constants, range sets, and bounds, respectively.
Three column choice rules are supported: The classical rule of Dantzig, the
steepest edge rule, and the maximum change rule. The algorithm can be easily
extended to allow the same problems to use differing column choice rules in different
iterations.
To preserve numerical stability, our implementation uses full pivoting
reinversion. The same procedure can be used to support basis crashing and warm
restarts. We use the specification for MPS given in Murtagh and Saunders [1998].
retroLP uses the EXPAND degeneracy procedure of Gill et al [1989] to improve
numerical stability and to avoid stalling and degeneracy.
retroLP is effective for dense linear programs with moderate aspect ratio.
Such problems arise, for example, in digital filter design, image processing, curve
fitting, and pattern recognition. The program can start from any assignment of values
to the variables, within bounds or not. In particular, retroLP can be used in a hybrid
computation with an interior point method along the lines suggested by Bixby et al
[2000].
In the simplex routine there are three steps in each iteration.

124
1. column selection using a column choice rule: column()
2. row selection: row()
3. the pivot: pivot()
Within column() there are many different possible column choice rules only one of
which is usually used for a given run, although mixing them is possible.
B.1 retroLP - the serial implementation
retroLP first preprocesses data that comes in the MPS format. This was
described in Appendix A. The main simplex routine fills up row m+1 with the Phase
1 objective. It then performs the following steps:
Phase 1
1) Do column selection. There are 3 possibilities:
a) If no column can be selected and the objective value is 0 Phase 1 is
over: do clean up and begin Phase 2.
b) If no column can be selected and the objective value is not 0 the LP is
inconsistent. Report this and stop.
c) If a column kp has been selected continue with the next step.
2) Do row selection. Select the row whose constraint is the first to be
violated. There are 2 possibilities:
a) The bound of variable kp is the first to be violated: ip is set to 0: let kp
go to its other bound.
b) Row i's constraint is the first to be violated: ip is set to i: go to step 3.
3) Do a pivot on element (ip,kp).
loop on Phase 1.
125
Phase 2 uses row 0 for the objective.
Phase 2
1) Do column selection.
a) If no column can be selected we are at the optimum and Phase 2 is
over.
b) If a column kp has been selected continue with the next step.
2) Do row selection. Select the row whose constraint is the first to be
violated. There are 3 possibilities:
go to its other bound
c) No constraint is violated: ip is set to -1: the problem is unbounded.
Report this and stop.
3) Do a pivot on element (ip,kp).
loop on Phase 2.
B.2 dpLP – the parallel implementation
dLP first preprocesses data that comes in the MPS format. This was described
in Appendix A. dpLP then divides the n columns into p groups. Each of the p
processors gets approximately n/p of the columns. Each processor stores its data in
the data structure given in Table A.1.

126
The main simplex routine fills up row m+1 with the Phase 1 objective for all
processors. It then performs the following steps:
Phase 1
1) Each processor does column selection on its columns; the global max is
calculated and sent to all the processors. There are 3 possibilities:
a) If no column can be selected and the objective value is 0 Phase 1 is over:
do clean up and begin Phase 2.
b) If no column can be selected and the objective value is not 0 the LP is
inconsistent. Report this (to all processors) and stop.
c) If a column kp has been selected continue with the next step.
2) The winning processor does row selection. It selects the row whose constraint
is the first to be violated. There are 2 possibilities:
a) The bound of variable kp is the first to be violated: ip is set to 0: let kp go
to its other bound. All the processors update their data.
3) The pivot column of the processor with the global max (winning processor) is
broadcast to all the processors together with the pivot row. Do a pivot on
element (ip,kp). The processors do this alone using the identical copy of the
winning column.
loop on Phase 1.
In Phase 2 each processor will use their row 0 for the objective.
Phase 2
127
1) Each processor does column selection on its columns; the global max is
calculated and sent to all the processors
a) If no column can be selected we are at the optimum and Phase 2 is
over.
2) The winning processor does row selection. It selects the row whose
constraint is the first to be violated. There are 3 possibilities:
go to its other bound. All the processors update their data.
c) No constraint is violated: ip is set to -1: the problem is unbounded.
Report this (to all processors) and stop.
3) The pivot column of the processor with the global max (winning
processor) is broadcast to all the processors together with the pivot row.
Do a pivot on element (ip,kp). The processors do this alone using the
identical copy of the winning column.
loop on Phase 2.
128
References
Bennett, K.P. and Olvi Mangasarian, “Robust Linear Programming Discrimination of
Two Linearly Inseparable Sets,” Optimization Methods and Software vol. 1 1992
pgs. 23-24.
Bixby, Robert E. and Alexander Martin, "Parallelizing the Dual Simplex Method,"
Informs Jourmal on Computing vol. 12 num. 1 Winter 2000 pgs. 45-56.
Bosch, Robert and Jason Smith, “Separating Hyperplanes and the Authorship of the
Disputed Federalist Papers,” Mathematical Monthly August-September 1998
Bradley, P. S. Usama Fayyad and Olvi Mangasarian, “Mathematical Programming for
Data Mining: Formulations and Challenges,” Informs Journal on Computing vol. 11
no. 3 Summer 1999 pgs. 217-238.
Bradley, P. S. and Olvi Mangasarian, “Feature Selection via Concave Minimization
and Support Vector Machines.” Machine Learning: Proceedings of the Fifteenth
International Conference(ICML '98) J. Shavlik editor, Morgan Kaufmann, San
Francisco, California 1998 pgs. 82-90,

129
Bruck, Jehoshua, Danny Dolev, Ching Ho, Rimon Orni and Ray Strong, “PCODE:
An efficient and Reliable Collective Communication Protocol for Unreliable
Broadcast Domains,” IEEE 9th International Parallel Processing Symposium (IPPS)
(1063-7133/95) April 1995 pgs. 130-139.
Chvátal, Vasek, Linear Programming. Freeman, 1983.
Comer, Douglas and David Stevens, Internetworking With TCP/IP Volume III:
Client-Server Programming and Applications, BSD Socket Version. 2nd edition
Prentice Hall 1996
Culler, David and Richard Karp et al, “LogP: Towards a Realistic Model of Parallel
Computation,” ACM Symposium on Principles and Practice of Parallel Programming
(PPOPP) May 1993.
D’Alessio, S., K. Murray, A. Kershenbaum and R. Schiaffino, “Category Levels in
Hierarchal Text Categorization,” Proceedings of the Third Conference on Empirical
Methods in Natural Language Processing June 1998.
Dongarra, Jack and Tom Dunigen, “Message-passing performance of various
computers,” University of Tennessee Technical Report CS-95-299 May 1996

130
Dongarra, Jack and Francis Sullivan, "Guest Editor's Introduction: The Top Ten
Algorithms," Computing in Science and Engineering January/February 2000 pgs. 22-
23.
Eckstein, Jonathan, I. Boduroglu, L. Polymenakos and D. Goldfarb, "Data-Parallel
Implementations of Dense Simplex Methods on the Connection Machines CM-2,"
ORSA Journal on Computing (INFORMS) vol 7 no. 4 Fall 1995 pgs. 402-416.
Eiselt, H.A. and C.L. Sandblom, "External pivoting in the simplex algorithm,"
Statistica Neerlandica vol. 39 no. 4 1985.
Eiselt, H.A. and C.L. Sandblom, "Experiments with External Pivoting." Computers
Operations Research vol. 17 no 10 1990 pgs. 325-332.
Forrest, John and Donald Goldfarb, “Steepest-edge simplex algorithms for linear
programming.” Mathematical Programming vol. 57 1992 pgs. 341-374.
Geist, G.A., J.A. Kohl and P.M. Papadopoulos, "PVM and MPI: a Comparison of
Features." Calculateurs Paralleles vol. 8 no. 2 June 1996 pgs. 137-150.
Gill, P.E., W. Murray, M.A. Saunders, M.H. Wright, “Maintaining LU factors of a
general sparse matrix,” Linear Algebra and its Applications vol. 88-89 1987 pgs. 239-
270.
131
Gill, P.E., W. Murray, M.A. Saunders, M.H. Wright, “A practical anti-cycling procedure
for linearly constrained optimization,” Mathematical Programming vol. 45 no. 3 1989
pgs. 437-474.
Goldfarb, D. and J.K. Reid, "A practical steepest-edge simplex algorithm,"
Mathematical Programming vol. 12 1977 pgs. 361-371.
Goudreau, Mark, Kevin Lang, Satish Rao, Torsten Suel and Thanasis Tsantilas,
“Portable and Efficient Parallel Computing Using the BSP Model.” IEEE
Transactions on Computers vol. 48 no. 7 1999 pgs. 670-689.
Hadley, G., Linear Programming. Addison Wesley 1962.
Hall, J.A.J. and K.I.M. McKinnon, “ASYNPLEX, an asynchronous parallel revised
simplex algorithm,” Technical Report MS95-050a Department of Mathematics
University of Edinburgh July 1997.
Hall, J.A.J. and K.I.M. McKinnon, “Update procedures for the parallel revised
simplex method,” Technical Report MSR 92-13 Department of Mathematics
University of Edinburgh September 1992.

132
Harris, Paula, “Pivot selection Methods of the Devex LP Code.” Mathematical
Programming vol. 5 1973 pgs 1-28.
Karp, R. and V. Ramachandran, “A Survey of Parallel Algorithms for Shared
Memory Machines,” Handbook for Theoretical Computer Science (J. van Leeuwen
editor), North Holland Amsterdam, 1990 pgs. 869-941.
Karp, Richard and A. Sahay, E.Santos and K. Schauser, "Optimal Broadcast and
Summation in the LogP Model," Symposium on Parallel Algorithms and
Architectures (SPAA) 1993 pgs. 142-153.
Kuhn, Harold and Richard Quandt, “An experimental study of the Simplex Method,“
Proceedings of symposia in applied mathematics Vol. XV (American Mathematical
Society Providence RI in Princeton University) 1963.
Luby, Michael and Noam Nisan, “A Parallel Approximation Algorithm for Positive
Linear Programming,” Association for Computing Machinery (ACM) (0-89791-591-7)
1993 pgs. 448-457.
Maekawa, Oldehoeft and Oldhoeft, Operating Systems: Advanced Concepts.
Benjamin Cummings 1987.

133
Mallat, Stéphane, A Wavelet Tour of Signal Processing. 2nd ed. Academic Press 1999 pg.
419.
Martel, Charles, “Maximum finding on a multiple access broadcast network,”
Information Processing Letters vol. 52 no. 1 1994 pgs. 7-13.
Murtagh, Bruce and M Saunders, “MINOS 5.5 User's Guide” Technical report SOL
83-20R Stanford University 1983-1998.
Nash, Stephen and Ariella Sofer, Linear and Nonlinear programming. McGraw-Hill
1996.
Ohio supercomputer center, MPI Primer/Developing with LAM. Ohio State
University 1995.
Padberg, Manfred. Linear optimization and extensions. Springer 1995.
Reid J.K., “Fortran subroutines for handling sparse linear programming bases,”
Report R8269 Atomic Energy Research Establishment Harwell England 1976.
Reid JK “A sparsity exploiting variant of the Bartels-Golub decomposition for linear
programming bases,” Mathematical Programming vol. 24 1982 pgs. 55-69.

134
Snir, Marc and Steve Otto et al, MPI: The complete Reference. MIT Press 1996.
Steiglitz, Kenneth, Parks and Kaiser. “METEOR: A Constraint-Based FIR Filter
Design Program.” Institute of Electrical and Electronics Engineers (IEEE)
Transactions on Signal Processing vol. 40 no. 8 August 1992.
Stunkel, Craig and Daniel Reed, “Hypercube Implementation of the Simplex
Algorithm,” Association for Computing Machinery (ACM) 1988 pgs. 1473-1482.
(0-89791-278-0 )
Uma, R.N., “Theoretical and Experimental Perspectives on Hard Scheduling
Problems,” PhD Dissertation Polytechnic University July 2000.
Valiant, Leslie, "A Bridging Model for Parallel Computation," Communications of
the ACM vol. 33 no. 8 1990 pgs. 103-111.
Wolfe, Philip and Leola Cutler, “Experiments in Linear Programming.” Recent
advances in mathematical programming, Graves and Wolfe eds., McGraw Hill New
York 1963.

Simplex

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simplex

Uploaded by

Copyright:

Available Formats

A Distributed Implementation of the Simplex Method

Submitted in Partial Fulfillment of the Requirements for the degree of

(Computer & Information Science)

All rights reserved.

Major: Computer Science

Minor: Financial Engineering

Microfilm or other copies of this dissertation are obtainable from:

UMI Dissertations Publishing

Bell & Howell Information and Learning

300 North Zeeb Road

P.O. Box 1346

Ann Arbor, Michigan 48106-1346

This Dissertation is dedicated to my family, all of whom have lent

I wish to express my sincerest thanks to the chairman of my dissertation

enterprise. We thank the staff at the Computer Science Department in Polytechnic

for other troubleshooting throughout the empirical studies. I wish also to

acknowledge my committee members, Alex Delis, Frederick Novomestky, and

Torsten Suel for taking the time to critique the dissertation.

A Distributed Implementation of the Simplex Method

Advisor: Richard Van Slyke

Submitted in Partial Fulfillment of the Requirements for the degree of

(Computer & Information Science)

easily and effectively be extended to a coarse grained distributed algorithm. While

relaxations of scheduling problems.

We implement two full tableau algorithms. The first, a serial implementation,

computation and communication in distributed computations. We tested our

algorithms on practical (Netlib) and synthetic problems.

6.5 A brief description of MINOS, a revised simplex implementation 75

few effective parallel or distributed implementations. By changing from today's

method can easily and effectively be extended to a coarse grained distributed

algorithm. The distributed computational environment we have in mind is p identical

dedicated workstations connected homogeneously by a broadcast network (Ethernet).

A natural approach to a distributed simplex method is to partition the columns

addition of a few simple communication functions, a distributed version was

computation costs see Section 3.5.

Our parallel program is good for large problems, sparse or not.

computation so that the communication latency of a distributed system would

amortized over the processors.

We first develop performance models for communication and computation.

We then analyze the running time, including computation and communication. We

characterize the optimal number of processors to minimize running time as a function

effective in this method. The dissertation is organized as follows. Section 2 lists

methods, including the revised simplex method, as represented by MINOS. Section 6

experiments on actual problems using our implementation of the method. Section 8

describes both the serial and parallel linear programming packages.

2.1 Interior point methods

2.2 Parallel implementations for the simplex method

Connection Machine CM-2. This implementation is fine grain and machine

systems. Another implementation used a hypercube network [Stunkel and Reed,

is to partition the rows.

a. Our method is for Ethernets, which provides broadcast

b. We provided performance models, which allow us to determine the

optimal number of processors.

c. We considered alternative column choice rules such as the steepest

e. A fast parallel program, that gives an approximate solution for a

special case of linear programs, is described in [Luby and Nisan,

1993]. CPLEX also implemented their simplex code on parallel

3. Review of the Simplex Method

3.1 A short review of linear algebra

We first look at a system of linear equations in two representations:

of m linearly independent columns of A, which we call a basis.

By renumbering columns we can combine these columns into a non-singular

square submatrix B and write A = [N|B] with B non-singular