You are on page 1of 147

A Distributed Implementation of the Simplex Method

DISSERTATION

Submitted in Partial Fulfillment of the Requirements for the degree of

DOCTOR OF PHILOSOPHY

(Computer & Information Science)

at the

POLYTECHNIC UNIVERSITY

by

Gavriel Yarmish

March 2001

Approved:

Department Head

Date

Copy No.__________
ii

Copyright 2001 by

Gabriel Yarmish

All rights reserved.


iii

Major: Computer Science


____________________
Richard Van Slyke

Professor of
Computer & Information
Science

___________
Date

____________________
Alex Delis

Associate Professor of
Computer & Information
Science

___________
Date

____________________
Torsten Suel

Assistant Professor of
Computer & Information
Science

___________
Date

Minor: Financial Engineering


____________________
Frederick Novomestky

Industry Associate
Professor of
Management

___________
Date
iv

Microfilm or other copies of this dissertation are obtainable from:

UMI Dissertations Publishing

Bell & Howell Information and Learning

300 North Zeeb Road

P.O. Box 1346

Ann Arbor, Michigan 48106-1346


v

Gavriel Yarmish was born in the United States. He was awarded his BS in

Computer Science, Magna Cum Laude, from Touro College in 1991 and his MA in

Computer Science from Brooklyn College in 1993. He is a member of Tau Beta Pi,

the engineering honor society, and has taught Computer Science and Mathematics

since 1995. The research presented in this thesis was completed between 1994 and

2001.
vi

This Dissertation is dedicated to my family, all of whom have lent

encouragement and support during the time spent on this research and before.
vii

Acknowledgments

I wish to express my sincerest thanks to the chairman of my dissertation

committee, Dr. Richard Van Slyke, for working with me throughout this long

enterprise. We thank the staff at the Computer Science Department in Polytechnic

University. Jeff Damens, our system administrator, helped install and troubleshoot

MINOS and MPI. He has also responded quickly to all network-related problems.

Torsten Suel provided input regarding various communication models. I also wish to

thank Joel Wein for the use of his lab during the early stages of my research and

Boris Aronov for his help and concern throughout the progress of this study. R. N.

Uma helped to explain the setup of MPI in the computing labs. I wish also to

acknowledge my good friend Jacob Maltz for his help with UNIX shell scripting and

for other troubleshooting throughout the empirical studies. I wish also to

acknowledge my committee members, Alex Delis, Frederick Novomestky, and

Torsten Suel for taking the time to critique the dissertation.


viii

Abstract

A Distributed Implementation of the Simplex Method

by

Gavriel Yarmish

Advisor: Richard Van Slyke

Submitted in Partial Fulfillment of the Requirements for the degree of

DOCTOR OF PHILOSOPHY

(Computer & Information Science)

March 2001

The Simplex Method, the most popular method for solving Linear Programs

(LPs), has two major variants. They are the revised method and the standard, or full

tableau method. Today, virtually all serious implementations are of the revised

method because it is more efficient for sparse LPs which are the most common.

However, the full tableau method has advantages as well. First, the full

tableau can be very effective for dense problems. Second, a full tableau method can

easily and effectively be extended to a coarse grained distributed algorithm. While

dense problems are uncommon in general, they occur frequently in some important

applications such as digital filter design, text categorization, image processing and

relaxations of scheduling problems.

We implement two full tableau algorithms. The first, a serial implementation,

is effective for small to moderately sized dense problems. The second, a simple

extension of the first, is a distributed algorithm, which is effective for large problems

of all densities.
ix

We developed performance models that predict running times per iteration for

the serial version of our method, the parallel version of our method and the revised

method for problems of different sizes, aspect ratios and densities. We also developed

methods for choosing the number of processors to optimize the tradeoff between

computation and communication in distributed computations. We tested our

algorithms on practical (Netlib) and synthetic problems.


x

Table of Contents
1. Introduction ------------------------------------------------------------------------------------ 1
2. Related Work ---------------------------------------------------------------------------------- 4
2.1 Interior point methods 4
2.2 Parallel implementations for the simplex method 4
3. Review of the Simplex Method ------------------------------------------------------------- 6
3.1 A short review of linear algebra 6
3.2 Definition of a Linear Program 7
3.3 Short description of the full tableau simplex method 10
3.4 Short description of the revised method 13
3.5 Running time comparison of the revised method and the full tableau
method 16
4. Motivation for a serial and distributed full tableau implementation -------------21
4.1 Why the method is applied to full tableau 21
4.1.1 No pricing out or column reconstruction 21
4.1.2 Alternate column choice rules can easily be used 21
4.1.3 The inverse does not have to be kept in each processor 23
4.1.4 Dense problems do not gain by use of the revised 23
4.2 Distributed computing 23
4.2.1 Why use distributed over MPP – no need for a supercomputer 23
4.2.2 Why coarse grain parallelism is used 24
5. Models and Analysis-------------------------------------------------------------------------25
5.1 Synchronized parallel pivots 25
5.2 Short sketch of one distributed pivot step 27
5.2.1 Choice of four models of parallel communication 28
5.2.2 Basic explanation of communication parameters 29
5.2.3 Analyzing basic communication operations 31
5.3 The Experimental Environment 34
5.3.1 Description of lab(s) 34
5.4 Estimates of computation parameters ucol, urow and upiv 34
5.5 Estimates of communication parameters s, g and L 36
5.6 The performance model 40
5.7 The optimum number of processors 41
5.8 The optimum number of processors with a new column division scheme 43
5.9 Estimated parallel running time for each communication model 45
5.10 Running time estimates of the revised (MINOS), serial (retroLP) and parallel
Full-tableau algorithms (dpLP) 46
5.11 Advantage of the Steepest Edge column choice rule 55
5.12 Memory requirements of revised and full tableau method 57
5.13 Asymptotic (computation/communication ratio change) analysis 59
5.14 Sensitivity to s, π and p 62
6. Implementation Choices --------------------------------------------------------------------72
6.1 Distributed programming software 72
6.2 Sockets 73
6.3 Reasons for choice of both sockets and MPI 73
6.4 The specific commands used 74
xi

6.5 A brief description of MINOS, a revised simplex implementation 75


7. Computational Experiments---------------------------------------------------------------78
7.1 Problems used for experimentation 78
7.1.1 Synthetic linear programs 78
7.1.2 Netlib Problems 83
7.2 Validation of Performance Models 86
7.2.1 Computation verification 86
7.2.2 Communication verification (using regression for coefficients) 93
7.2.3 Analysis of the revised program (MINOS) expression 98
7.2.4 Revised vs. retroLP and dpLP 101
7.3 Total Time Comparisons 105
8. Summary, Applications and Future Work ------------------------------------------- 110
8.1 Summary 110
8.2 Applications with dense matrices 111
8.3 Future work 112
Appendix A. Form of Linear Program input: LPB and MPS---------------------- 115
A.1 Preprocessing into LPB format 115
A.2 The MPS format 119
Appendix B. Program description: retroLP and dpLP------------------------------ 123
B.1 retroLP - the serial implementation 124
B.2 dpLP – the parallel implementation 125
References ----------------------------------------------------------------------------------------- 128
xii

Figures
Figure 3.1- Full tableau vs. the revised methods......................................................... 17
Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000) . 20
Figure 5.1- time per iteration as n increases................................................................ 45
Figure 5.2 – Time per iteration (m=1,000).................................................................. 49
Figure 5.3-Time per iteration as a function of density................................................ 51
Figure 5.4-Time per iteration as a function of aspect ratio ......................................... 52
Figure 5.5- Time per iteration as a function of m........................................................ 52
Figure 5.6 – Time per iteration as a function of p....................................................... 53
Figure 5.7 – Time per iteration as a function of p....................................................... 54
Figure 5.8 – Memory (in Megabytes) ......................................................................... 59
Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move
together................................................................................................................ 61
Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 .
Computation is changing..................................................................................... 62
Figure 5.11 – Time per iteration as a function of relative error in s ........................... 70
Figure 5.12 - p* as a function of relative error in s ..................................................... 70
Figure 5.13 - Time per iteration as a function of relative error in π ........................... 71
Figure 5.14 – p* as a function of relative error in π.................................................... 71
Figure 7.1 – Total time by generator vs. density......................................................... 81
Figure 7.2 – Time per iteration by generator vs. density ............................................ 82
Figure 7.3 – Iteration time vs. mn (classical column choice rule) .............................. 90
Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule) ............. 91
Figure 7.5-Actual timing as a function of Density.................................................... 103
Figure 7.6- Actual timing as a function of p ............................................................. 103
Figure 7.7 – retroLP vs. MINOS............................................................................... 107
Figure 7.8 – Speedup relative to MINOS (m=500, n=1,000).................................... 108
Figure 7.9 Speedup relative to MINOS (m=1,000, n=5,000).................................... 109
Figure A.1.................................................................................................................. 120
xiii

Tables
Table 5.1-Coefficient estimates................................................................................... 38
Table 5.2 – Coefficient estimates used in models....................................................... 38
Table 5.3- Expressions of the six models.................................................................... 39
Table 5.4 – p* and optimal time per iteration ............................................................. 39
Table 5.5 – Time per iteration for m=1,000 ................................................................ 46
Table 5.6- estimated running time per iteration .......................................................... 48
Table 5.7-p* and T as a function of relative error in s ................................................ 70
Table 5.8- p* and T as a function of relative error in s ............................................... 71
Table 7.1 - Netlib problems sorted by density ............................................................ 85
Table 7.2- percentage errors in both groups of problems. .......................................... 96
Table 7.3-The 24 problems used ................................................................................. 97
Table 7.4- Actual timing as a function of Density .................................................... 104
Table 7.5- Actual timing as a function of p............................................................... 104
Table A.1-Data structure for retroLP and dpLP........................................................ 117
1

1. Introduction

The simplex algorithm of linear programming has been cited as one of "10

algorithms with the greatest influence on the development and practice of science and

engineering in the 20th century," [Dongarra and Sullivan, 2000]; however, there are

few effective parallel or distributed implementations. By changing from today's

popular form of the simplex method, the "revised" form, to the earlier "standard"

form (full tableau) we have been able to implement an effective coarse grained

distributed algorithm which is a simple extension of the standard form of the simplex

method. We thus reexamine the original version, the full tableau method of the

simplex algorithm. Today, virtually all serious implementations are of the revised

method because it is much faster for sparse LPs which are most common. However,

the full tableau method has advantages as well. First, the full tableau can be very

effective for dense problems. Second, as we have already mentioned, a full tableau

method can easily and effectively be extended to a coarse grained distributed

algorithm. The distributed computational environment we have in mind is p identical

dedicated workstations connected homogeneously by a broadcast network (Ethernet).

A natural approach to a distributed simplex method is to partition the columns

of the linear program among processors. This has several implications. All activities

performed on columns are essentially reduced linearly. This in turn suggests the use

of the full tableau simplex method in place of the more standard revised simplex

method. In the tableau method no processor needs to keep a copy of the basis or its

inverse. Moreover, the work of updating the columns can be done in parallel. We

wrote a serial linear programming code based on the full tableau. Then, with the
2

addition of a few simple communication functions, a distributed version was

constructed.

The revised method is faster for sparse problems. For dense problems, though,

the full tableau method may perform better. This is because the revised method must

calculate data that the full tableau would already have. For a comparison of the

computation costs see Section 3.5.

Our method is good for dense problems even when using the serial program.

Our parallel program is good for large problems, sparse or not.

One issue that must be addressed is how to give each processor enough

computation so that the communication latency of a distributed system would

amortize. One way is to pick better columns by using alternate column choice rules.

These alternate rules cost more in computation than the standard rule. On the other

hand, these rules may reduce the number of iterations, see [Wolfe and Cutler, 1963],

[Kuhn and Quandt, 1963], [Goldfarb and Reid, 1977], [Forrest and Goldfarb, 1992]

and [Fletcher, 1998]. By pricing out columns in parallel, the extra cost can be

amortized over the processors.

We first develop performance models for communication and computation.

We then analyze the running time, including computation and communication. We

characterize the optimal number of processors to minimize running time as a function

of size, aspect ratios and density. We also see which column choice rule would be

effective in this method. The dissertation is organized as follows. Section 2 lists

related work both in parallel implementation of the simplex method and in other

methods of solving linear programs. Section 3 gives a review of the simplex method.
3

Section 4 adds to the motivation of our work. It explains in more detail the types of

problems for which this method is applicable and how, with this method, different

column choice rules can be employed. Section 5 explains in detail the communication

models and their analysis. It shows how well our method does in comparison to other

methods, including the revised simplex method, as represented by MINOS. Section 6

explains how we implemented our serial and parallel method. Section 7 details

experiments on actual problems using our implementation of the method. Section 8

offers conclusions and possibilities for future work. Two appendices follow; one

specifies the MPS input format our linear program packages accept and the other

describes both the serial and parallel linear programming packages.


4

2. Related Work

2.1 Interior point methods

The simplex algorithm is not the only way to solve a linear program. There

are other methods. The main competitors are a group of methods known as interior

point methods. Some interior point methods have polynomial worst case running

times, which are less than the exponential worst case running time of the simplex

method [Nash and Sofer, 1996 pgs. 269-278]. On average though, the simplex

method is competitive with these methods. With the simplex method, post-run

analysis is also possible [Nash and Sofer, 1996]. This dissertation focuses on the

simplex method.

2.2 Parallel implementations for the simplex method

Parallel implementations for general linear programs have been based on the

revised method [Hall and McKinnon, 1997] and the full tableau method [Eckstein et

al, 1995]. The parallel implementation of the full tableau used "stripe arrays" on the

Connection Machine CM-2. This implementation is fine grain and machine

dependent. The approach we use is coarse grained and can be applied to distributed

systems. Another implementation used a hypercube network [Stunkel and Reed,

1988]. Stunkel and Reed also used the full tableau form of the simplex method. They

actually compared two ways of partitioning the constraint matrix amongst the

processors on a hypercube. The first way is similar to our partitioning scheme where

groups of columns are partitioned and given to different processors. The second way

is to partition the rows.


5

A few of the differences between our paper and the hypercube implementation

are

a. Our method is for Ethernets, which provides broadcast

communication.

b. We provided performance models, which allow us to determine the

optimal number of processors.

c. We considered alternative column choice rules such as the steepest

edge rule.

d. We compared our method to the revised method for dense and sparse

problems.

e. A fast parallel program, that gives an approximate solution for a

special case of linear programs, is described in [Luby and Nisan,

1993]. CPLEX also implemented their simplex code on parallel

machines based on the dual simplex method [Bixby and Martin, 2000].
6

3. Review of the Simplex Method

3.1 A short review of linear algebra

We first look at a system of linear equations in two representations:

Ax = b

or

åa
j =1, m + n
i, j x j = bi i = 1,..., m

We assume, for now, that A is m x (n+m) and of full rank. Thus there is a set

of m linearly independent columns of A, which we call a basis.

By renumbering columns we can combine these columns into a non-singular

square submatrix B and write A = [N|B] with B non-singular

éx ù
We also write x = ê N ú
ë xB û

Then we have: NxN + BxB = b ⇔ Ax = b

Since B is non-singular:

B-1Ax = B-1 [N|B]x = B-1NxN + B-1BxB = B-1b, or

NxN + IxB = b, or

IxB = b - N'xN, where

N = B-1N, and b = B-1b.

In the other notation, IxB = b - NxN becomes:

n
x n +i = bi − å aij x j (i = 1,2,..., m) (3.1)
j =1
7

The variables xB = [xn+1, xn+2, ... , xn+m] are basic (dependent), and the

variables xN = [x1, x2, ... , xn] are non-basic (independent).

For any assignment of non-basic variables, we use (3.1) to determine basic

variables in a solution.

Following Chvátal [1983], we call this a dictionary representation.

If we are given a dictionary (3.1) with assignments of the non-basic variables

satisfying their constraints, so that the resulting values for the basic variables using

(3.1) happen to satisfy their constraints, we say that the dictionary and the non-basic

assignments are feasible.

This section (3.1) explained how a system of equations could be converted

into dictionary form. The following section continues the discussion with the addition

of an objective function.

3.2 Definition of a Linear Program

The Linear Programming terminology used here is similar to that used in

books detailing the full tableau method. One such book is [Chvátal, 1983].

We start with:

Maximize z = cx
x

Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., n

As before we assume that we have a basis, B, to put this in dictionary form:

n
x =b − å a x (i = 1,2,..., m)
n+i i ij j
j =1

We can also use this to eliminate all the basic variables in z = cx.

This leads us to the dictionary representation of Chvátal [1983].


8

n
Maximize z = c0 + å c j x j
x
j =1

n
Subject to x = b ' − å a ij ' x (i = 1,2,..., m) (3.2)
n+i i j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n

Since this is the representation we will use from now on, we drop the primes

from the coefficients. The dictionary is said to be feasible for given values for x1,…,xn

if the given values satisfy their bounds and if the resulting values for xn+1,…,xn+m

satisfy theirs.

Suppose our dictionary besides being feasible has the following optimality

properties:

(i) for every non-basic variable xj that is strictly below its upper bound we

have cj ≤ 0, and

(ii) for every non-basic xj that is strictly above its lower bound we have cj ≥ 0.

Such a dictionary is said to be optimal. It is easy to see that no change in the

non-basic variables will increase z and hence the current solution is optimal.

In the next example we assume that B = I Þ B −1 = I . We take a linear

programming problem and convert it into an initial, possibly infeasible, dictionary. A

procedure called Phase I converts an infeasible dictionary into a feasible dictionary.

Details of Phase I are not discussed here. See [Nash, 1996] or [Chvátal, 1983].

Given a problem of the form


9

Maximize c1 x1 + c2 x2 + L cn xn
Subject to a11 x1 + a12 x 2 + L a1n x n op b1
a 21 x1 + a 22 x 2 + L a 2 n x n op b2
M
a m1 x1 + a m 2 x 2 + a mn x n op bm
where op refers to any of the relations = , <= or >= and the
var iables can be bounded from above and below.

We first add slack, surplus and artificial variables to the constraints. This

results in the following form:

Maximize c1 x1 + c2 x2 + L cn xn =
Subject to a11 x1 + a12 x 2 + L a1n x n + s1 = b1
a 21 x1 + a 22 x 2 + L a 2 n x n + s2 = b2
M O
a m1 x1 + a m 2 x 2 + a mn x n + sm = bm
where slacks <= 0, surpluses >= 0 and artificials = 0

We then solve for the s variables:

Maximize z = c1 x1 + c2 x2 + L cn xn
Subject to s1 = b1 − a11 x1 − a12 x 2 L − a1n x n
s2 = b2 − a 21 x1 − a 22 x 2 L − a 2n xn
M
sm = bm − a m1 x1 − am 2 x2 L − a mn x n
where slacks <= 0, surpluses >= 0 and artificials = 0

which gives us a “Dictionary” [Chvátal, 1983].

The "a" values can be put into a tableau (matrix). Appendix A gives an

example of this transformation.

This and the previous section (3.1 and 3.2) explained how to put a linear

program into dictionary form. No pivots were necessary. The following section

explains the simplex method optimization process, which employs pivoting.


10

3.3 Short description of the full tableau simplex method

The simplex method operates by performing iterations on a feasible

dictionary. Each pivot "increases" the objective until it is optimal, it is shown to be

unbounded or it is shown to be infeasible. An earlier implementation of the full

tableau simplex method is provided in [Press, 1992].

In order to get an initial feasible solution, another linear programming

problem is solved. This linear program uses the same procedure as explained here. It

uses a different objective on the same dictionary. This first linear program is known

as Phase I. The linear program we need to solve, which uses the real objective, is

known as Phase 2. More details on Phase 1 can be found in linear programming

books, for example [Chvátal, 1983].

Within each iteration there are three steps:

1) Choose a non-basic variable (column choice):

Select a non-basic variable xj, with its cost coefficient cj>0, in equation

(3.2) that is not at its upper bound, or one with its cost coefficient cj<0 and not

at its lower bound. There may be many such eligible choices. For now, any

will work. See the discussion below for possible alternatives. If the largest

absolute value is 0 you are at the optimum and should exit.

2) Determine which variable becomes non-basic (row choice):

As we modify the non-basic variable in the direction determined in step 1)

three possibilities exist:

i) A bound of the non-basic variable of the winning column is the first to be

violated. Let the variable go to this other bound.


11

ii) The basic variable in row i is the first to violate its bound. Pivot using the

violated row.

iii) No constraint is violated. The problem is unbounded.

3) Perform a pivot or move a non-basic variable from one bound to its other

bound.

The simplex just described uses the classical (Dantzig) column choice rule in

step 1. This is the most commonly used column choice rule. Other column choice

rules are possible.

[Wolfe and Cutler, 1963] and [Kuhn and Quandt, 1963] were early studies of

column choice rules. These studies evaluated how different column choice rules

affect running time. The issue was the tradeoff between using relatively complex

(slow) choice rules to reduce the number of iterations and the resulting increased

running time per iteration. In addition to the classic rule introduced by Dantzig the

“greatest change rule" and the "steepest edge rule" were among those tested. The

results were that while the more complex column choice rules would decrease the

number of iterations, the cost of applying those rules was too great for the problems

they studied. The extra computation required took away any speed-up gained by the

reduction in pivot iterations. They performed the tests using the full tableau method.

Today the revised method is more common.

Harris [1973], Goldfarb and Reid [1977], Forrest and Goldfarb [1992] and

Fletcher [1998] studied how to implement the Steepest Edge rule efficiently in the

revised method. They stored extra information in order to keep a recurrence formula

updated. They report an overall gain in computation speed when using the steepest
12

edge rule instead of the classic rule. Another column choice rule that is only feasible

to implement in the full tableau method is the “Greatest Change Rule." These three

general classes of column choice rules were implemented in our codes and are

explained in more detail below.

Classical (Dantzig): Take the eligible column with maximum |c'j|; its

complexity is basically a constant number of additions/subtractions per

column, or order n for the entire process. In the full tableau method the current

objective row is readily available.

Steepest Edge: For each eligible column, divide |c'j| by the length of the

column of A, 2
å a'i, j , and take the largest of these quotients. For increased
i

efficiency, one actually considers c' 2j å a'


i
2
i, j . The complexity here is m+1

multiplication/divisions, and a similar number of additions/subtractions, worst

case, per column evaluated. It is important to note that this calculation is only

necessary for “eligible” non-basic variables candidates that would increase the

objective value if brought into the basis. This is discussed further in Section

7.2.1.

Greatest Change: For each eligible column, actually compute how much the

objective function would improve if the column were introduced, and select

the one which would cause the greatest change. The complexity here is

essentially that of the row choice rule, order m multiplications/divisions and


13

additions/subtractions per column evaluated. Again, The full calculation is

only needed for eligible columns.

The greater amount of work for the steepest edge and greatest change rules are

more than made up for by the reduction in the number of iterations using the

standard method. However, the greatest change rule is very hard to use in the

revised method. The steepest edge is moderately difficult to implement in the revised

method; special techniques are required [Goldfarb and Reid, 1977], [Forrest and

Goldfarb, 1992].

The steepest edge or greatest change rule is definitely worthwhile in our

distributed implementation (see Section 4.1.2).

The full tableau (matrix) method, unlike the revised simplex method, stores

and pivots on the whole tableau (matrix) of m rows and n columns. The full tableau

method uses information from the top row, which is the objective vector, for the

standard column choice rule. If other column choice rules are used additional

information is needed from the rest of the tableau.

3.4 Short description of the revised method

The Revised Simplex Method is commonly used for solving linear programs.

This method operates on a data structure that is roughly of size m by m instead of the

whole tableau. This is a computational gain over the full tableau method, especially in

sparse systems (where the matrix has many zero entries) and/or in problems with

many more columns than rows. On the other hand, the revised method requires extra

computation to generate necessary elements of the tableau: the current cost


14

coefficients and the entering pivot column for the column choice and the row choice

respectively. These are computational costs the full tableau method doesn't have.

In the full tableau method we started with:

Maximize z = cx
x

Subject to Ax = b
l j ≤ x j ≤ u j j = 1,..., m + n

then using a basis B, and its inverse, we obtained a dictionary form:

n
Maximize z = c0 + å c' j x j
x
j =1

n
Subject to x = b' − å a ' x (i = 1,2,..., m)
n+i i ij j
j =1
l j ≤ x j ≤ u j j = 1,..., m + n

In the revised method, the second form is represented implicitly in terms of

the original system together with a functional equivalent of the inverse of the basis B

rather than explicitly in the dictionary form. "Functional equivalent" means we have a

data structure which makes solving πB = cB for π and BA’ j = Aj for A’ j, easy. Aj

represents the jth column of the A matrix and cb represents the basic objective

coefficients. The data structure need not be B-1 or even necessarily a representation of

it. For example, an LU decomposition of the basis is often used, see [Nash and Sofer,

1996 pgs. 218-222]; another is to represent the inverse as a product of simple pivot

matrices see [Nash and Sofer, 1996] and [Chvátal, 1983 Ch. 7]. Given the implicit

representation, we recreate the data needed to implement the three parts of the

simplex iteration. Thus, to "pivot" we update b’ and our functional equivalent of B-


1
.We now sketch the steps of the revised simplex method.

Select Column:
15

We must have the coefficients c'j available. We use multiples of the

original constraints to eliminate the basic variables in the expression for z.

Symbolically, we let the component row vector π represent the multiples; that

is, we multiply constraint i by πi and subtract the result from the expression

for z. To make this work π must have πB = cB where cB, as above, represents

the m elements of c corresponding to the basis columns. Then c'j in the

dictionary is given by c' j = c j − π A j where Aj represents the jth column of

A.

This computation is called pricing. So to select the column using the

classical Dantzig rule, the vector π must be calculated and then the inner

product of π with each column of A must be subtracted from the original

coefficient.

The revised method takes more effort than the standard simplex

method in this step. However, for sparse matrices, pricing out is speeded up

because many of the products have zero factors. Moreover, the revised

simplex method can be speeded up considerably by using partial pricing

[Nash and Sofer, 1996 p. 222]. Partial pricing is a heuristic of not considering

all the columns during the column choice step. On the other hand, alternate

column choice rules such as steepest edge (however see [Forrest and

Goldfarb, 1992]) and greatest change are much more difficult to implement

using the revised approach.

Select Row:
16

To implement this we need b' and the column from the dictionary that

we chose in Step 1, A' j = (a'1s, a'2s, ..., a'ms)T. The b' vector will be updated

from iteration to iteration; it does not need to be recreated. A' j is given by

solving BA' j = Aj.

Here, sparsity in A pays off again. In the standard simplex method, as

we go from dictionary to dictionary we quickly loose whatever sparsity there

was in the original matrix A. In the revised method since we always go back

to the original matrix, we still have the original sparsity. Specifically, we have

the original sparsity in Aj.

Pivot:

If we need to pivot, instead of explicitly pivoting in a dictionary as before,

we update our functional equivalent of B-1.

3.5 Running time comparison of the revised method and the full tableau

method

The revised simplex method is especially efficient for linear programs that are

sparse and have high aspect ratio (n/m). A linear program is sparse if most of the

elements of the dictionary are 0, and it has a high aspect ratio if n/m is large.

Updating any of the representations used by the revised method is usually, at worst,

of order m2. On the other hand, pivoting on the explicit representation of the

dictionary takes order mn. Thus for high aspect ratios the standard method takes much

more work. Fortunately, in our distributed method, this work is done in parallel with

linear speedup in a straightforward way. This will be made clearer as we derive

performance models for the revised and full tableau methods later in this section.
17

Figure 3.1 compares advantages and disadvantages of the full tableau vs. the

revised methods.

Revised Simplex Method Standard Simplex Method


Takes better advantage of sparsity in problems Is more effective for dense problems
Is more efficient for problems with large aspect Is more efficient for problems with low aspect
ratio (n/m) ratio.
Can effectively use partial pricing Can easily use steepest edge, or greatest change
pricing in addition to the classic choice rule.
Is difficult to perform efficiently in parallel, Very easy to convert to a distributed version
especially, in loosely coupled systems. using a loosely coupled system.
Frequently, the functional equivalent of the basis Rarely, the dictionary has to be recomputed from
inverse is recomputed both for numerical stability the original data to maintain numerical stability
and for efficiency (e.g., maintaining sparsity). (but not for efficiency). The work is substantial.
The work is modest.
Figure 3.1- Full tableau vs. the revised methods

We will now give expressions to estimate the time per iteration of the revised

and full tableau methods. For simplicity we will count only multiplication and

division operations. The full tableau method using the classical column choice rule

requires m operations for the ratio test and (m+1)(n+1) for the pivot. Column Choice

requires n comparisons (insignificant cost) for the classical column choice rule and

(m+1)n (multiply/divides) for the steepest edge rule. Initially, we consider the

classical column choice rule and will therefore ignore column choice cost. This totals

mn+2m+n+1 operations per pivot iteration.

We can safely choose the iteration time estimate based on the explicit inverse

as an upper bound on the true value, since the more exotic functional equivalents of

the basis inverse would not be used if the performance were not better.
18

The revised method using the explicit inverse requires m2 to determine π, mn

floating point multiplication/division and addition/subtraction operations to price out

the current objective row, m2 to compute the entering column, m for the ratio test and

(m+1)2 for the pivot. This totals mn+3m2+3m+1. This assumes an explicit inverse;

this can be reduced significantly for sparse problems or if we use a more effective

functional equivalent of the basis inverse such as LU decomposition.

Assuming a dense matrix, the time per iteration (measured by the number of

multiplications and divisions) of the revised and full tableau methods are equal when

n=3m2+m. The last line of Table 5.6 in Section 5.7 shows an example. If n<3m2+m

the full tableau method should be faster. If n>3m2+m the revised method should be

faster. That means that for dense problems the full tableau method requires less

computation than the revised method for n<3m2+m. This analysis is for the revised

using the explicit inverse representation. For other representations the analysis

doesn’t apply on a per iteration basis.

As just explained, the revised method pivots on a portion of what the tableau

method pivots on; m columns instead of n columns. On the other hand, it must

calculate the current pivot column and objective row from the original problem. If the

original problem has zeros the revised need not multiply those zero elements. One

cannot count on the pivot itself being shortened since the current pivot column is not

part of the original (sparse) data. Similarly the full tableau method can't take

advantage of this sparsity since it pivots on derived data instead of the original data.

Since the revised method does not do operations on zero elements, there are

potential savings for sparse problems. Let d (for density) be the average number of
19

nonzero elements. For example if we assume that on average each column has 5% of

its values nonzero, then d=5%. Determining π, pricing out and calculating the

entering column will take about dm2, dmn and dm2 respectively. The pivot and ratio

tests still require roughly the same number of operations as before: m and (m+1)2

respectively. The total running time of a sparse problem on the revised method using

the explicit inverse is approximately 2dm 2 + m 2 + dmn + 3m + 1 . Assuming an

extreme of complete sparseness, where d is 0, the running time is m 2 + 3m + 1 . It

equals the running time of the full tableau method when n=m. That means that in

complete sparseness the revised is faster as soon as n gets larger than m, which is the

usual case. A similar discussion to this one can be found in Nash and Sofer [Nash and

Sofer, 1996 p. 115].

For example, assume m=1,000 and there is 5% density (d). The running time

of the revised method can be estimated as

50n + 100,000 + 1,000,000 + 3,000 + 1 = 50n + 1,103,001 . The running time of the

tableau method is 1,000n + 2,000 + n + 1 = 1,001n + 2,001 . The running times of the

m(m(2d + 1) + 1)
revised and full tableau methods are equal when n = . When n=1,158
m(1 − d ) + 1

the revised and tableau methods take about the same time. When n<1,158 the tableau

method is faster. Once n>1,158 the revised method is faster. Figure 3.2 is a graph that

shows for m=1,000 at what n the revised method overtakes the tableau method. This

is shown for varying density.


20

4500

4000 3994

3500 3450

3000 2997

2613
2500
2284
n

2000 1999
1749
1500 1529
1333
1158
1000 1000

500

0
0 0.1 0.2 0.3 0.4 0.5
Density, d

Figure 3.2-n at which the revised method overtakes the tableau method (m=1,000)
21

4. Motivation for a serial and distributed full tableau implementation

4.1 Why the method is applied to full tableau

Aside from the two early papers mentioned, recent research has focused on

the revised method. We use the full tableau method. There are a number of reasons

for this. The two main reasons are that it is better for dense problems (Section 4.1.4)

and that a coarse grained distributed program is straightforward (Section 4.1.3). More

specifically:

4.1.1 No pricing out or column reconstruction

As stated above the revised method requires extra computation to calculate

(price out) the objective row. It also requires extra computation to calculate the

column entering into the basis. Using the full tableau method avoids this computation.

4.1.2 Alternate column choice rules can easily be used

Another advantage of using the full tableau method is the possibility of using

multiple column choice rules. These were briefly mentioned in Section 2.1 and in

Section 3.3. The classical (Dantzig) rule simply selects the column with the maximum

coefficient in the objective row of the updated tableau. It is easy to use this rule. The

cost of deciding which column will enter the basis, using the Dantzig rule, is only n

comparisons. The revised method as mentioned in previous sections must determine

the multipliers and price out; a cost of m2+mn.

Other column choice rules, on the other hand, require the values of the

complete nonbasic column in addition to the value of the objective coefficient.

The full tableau method allows the use of these other column choice rules

without the extra computational cost of recreating those columns. Its only additional
22

cost is m multiplications per column. This computation is amortized over the different

processors in dpLP.

The revised, on the other hand, does not keep the updated column on hand.

(Note the m2 cost of computing the entering column.) To use these other rules the

revised must recompute every column (not only the entering column) from the

inverse, instead of just pricing out the objective row. In addition to the mn cost of

pricing out it would now cost m2n to reconstruct all n columns. Reid, Goldfarb and

Forrest [1977,1992] addressed this problem for the Steepest-edge rule. They used a

recurrence formula to update the steepest edge direction (norm). There is a substantial

cost to initializing this formula. In addition, each iteration takes longer due to the cost

of updating this recurrence formula. They have reported that using this Steepest Edge

rule generally cuts down the total computation time for any problem size; that is, it

reduces the number of iterations enough to more than compensate for the added per-

iteration cost. A few of the rules that can be used with the full tableau method

include the greatest change method and different gradient methods. Wolfe-Cutler

[1963] and Kuhn-Quandt [1963] have studied these column choice rules using the full

tableau method. They each took rather small linear programming problems and ran

them using various column choice rules. They calculated the time per iteration and

the number of iterations for each run. From these runs they concluded that on average

the greatest change method and different forms of the steepest edge method result in

fewer iterations. The extra cost per iteration, though, was too costly and made the

algorithm as a whole slower when used with these alternate rules. We did similar tests

on our serial implementation on large problems and did find a gain in computation
23

time for alternate rules. We compare the standard rule to the steepest edge rule in

Sections 5 and 7.

4.1.3 The inverse does not have to be kept in each processor

Our distributed method divides the columns amongst the processors. Using

the revised method requires part of the tableau to be recreated from the inverse each

iteration, which makes a parallel method difficult. At least one processor would need

a copy of the whole m-by-m inverse or its functional equivalent. The full tableau

method does not need to recreate any row or column and can simply hold as many

columns as it wants without the extra inverse overhead. Using the full tableau method

makes it unnecessary for any processor to carry the inverse of the basis.

4.1.4 Dense problems do not gain by use of the revised

Our method works for all problems, but it performs best when used on dense

problems. This is because the revised method is slower than the full tableau method

for dense problems even on one processor, assuming n is not too much greater than

m, as was noted in Section 3.3. For dense problems there is no point in using the

revised method. Applications that give rise to dense problems are given in Section

8.2.

4.2 Distributed computing

Distributed computing is the use of multiple computers communicating over a

loosely connected network. They do not share memory but communicate via message

passing [Maekawa, 1987].

4.2.1 Why use distributed over MPP – no need for a supercomputer

A distributed system has a number of advantages over a massively parallel

machine. It can be composed of readily available computers. A distributed program


24

can be run on a distributed system of dedicated computers. This would mimic a

supercomputer. It can also be run on any network of workstations that is not using all

its CPU resources. A special supercomputer is unnecessary.

4.2.2 Why coarse grain parallelism is used

When using a distributed network, special attention must be paid to the

tradeoff between communication time and the time spent by the processors doing

useful work. The physical communication among processors in a distributed system is

slower than among processors in a supercomputer. Thus, while relatively fine-grained

parallelism can be used with supercomputers, in a distributed system only coarse-

grained parallelism can be used effectively. The high communication time to

computation time ratio in message passing systems is much higher than that for

supercomputers. This has significant implications for our work.


25

5. Models and Analysis

5.1 Synchronized parallel pivots

Our general approach is to divide the n columns of the problem amongst p

processors. There is no overlap of columns. All vectors that hold information for the

non-basic variables are also divided. The x vector, which holds the values of the non-

basic variables, is therefore divided. The b vector, which holds the values of the basic

variables, is completely copied to all the processors. At each iteration, every

processor calculates the best candidate for the pivot column from among its columns,

using one of the column choice rules. For each column choice rule that we have

discussed, a numerical value is assigned to each column by the rule, and the column

with the maximum value is chosen by the rule. In parallel, each processor looks at its

columns. The best of these values determines the local column chosen. This column

is that processor’s proposal. A coordinating processor then calculates the best value

from among the proposals. In our (first) implementation, only one column is proposed

per processor although generalizations to multiple proposals are easy to design (see

Section 8.3). The processor with the winning proposal sends its column to all

processors who then pivot on their columns (explained below). This pivot is the last

step of the iteration. The iteration is repeated until optimality, infeasibility or

unboundedness is detected. Program details are left to the appendices.


26

Communications required for parallel programming


Our parallel programming algorithm requires the following communication

among the processors.

1. Initialization

At the beginning of processing before any iteration, each participating

processor must be sent the initial value of the basic variables, b, and a

subset of the columns of the problem (approximately n/p columns

each).

2. Column Proposal

Every iteration, each processor must make known the value of its best

column according to the column choice rule being used.

3. Pivot Column Broadcast

Every iteration, the best candidate among the proposals of the

processors is selected; the complete winning column must then be sent

to each processor.

4. Finalization

At the end of the processing, after all iterations, the information

associated with the non-basic variables at each processor must be

consolidated to provide the final solution.

Since Steps 1 and 4 are executed only once, we therefore give them less

consideration than Steps 2 and 3, which must be performed each iteration.

There are two implementations of Steps 2 and 3, which we refer to as

centralized and decentralized respectively.


27

In either approach, Steps 1 and 4 are both coordinated by a distinguished

processor. In the centralized approach this processor also coordinates Steps 2 and 3.

That is, each processor sends the value of its column proposal to the coordinator. The

coordinator selects the winner, and tells the winning processor to send its column to

all the other processors. In the decentralized approach, each processor broadcasts its

proposal to all the other processors. Simultaneously, all processors can determine the

winner; specifically the winning processor is able to determine that it is the winner. It

can then broadcast its column to all the processors. No coordination is required.

A broadcast communication medium such as an Ethernet or token ring local

area network can facilitate this processing. For example, the algorithm requires only

one broadcast transmission of order m doubles to accomplish step 3. (Getting

application layer communication facilities to broadcast in order m time is not easy;

see Section 6.2.). Step 2 using the centralized approach requires p point-to-point

transmissions of essentially one double to the coordinator. The coordinator then sends

a point-to-point transmission to the winning processor. In the decentralized approach,

this is replaced by p broadcasts of a double one from each processor.

5.2 Short sketch of one distributed pivot step

A more complete program description is provided in the appendices.

Assume we are in phase 2:

a) Each processor selects and sends its local best column to the main

coordinating processor. The main processor calculates the global max. If no

column can be selected we are at the optimum, and Phase 2 is over.


28

b) The processor with the global max selects the variable (row) to leave the basis

and ships a copy of its winning column together with its row choice to every

processor.

c) Next, there are 3 possibilities, as in every simplex algorithm:

i) A bound of the non-basic variable corresponding to the winning column is

the first to be violated. Let the variable go to this bound.

ii) Row i's constraint is the first to be violated. We then pivot using the

violated row.

iii) No constraint is violated. The problem is unbounded.

5.2.1 Choice of four models of parallel communication

Several models of parallel computation were investigated. One is the LogP

model proposed by Culler et al [1993]. Another was the BSP model proposed by

Valiant [1990]. In addition to the previous two we looked at simple models assuming

the use of local area networks (LAN’s) such as Ethernets or Token Rings for

communication. Most LAN’s are intrinsically broadcast devices, however not all

software for using them in distributed computing takes advantage of that capability.

We considered LAN communication both with and without broadcast primitives. The

LogP model is a model for asynchronous computation whereas the BSP is a model for

synchronous computation. In choosing which model to use, we had to look at our

algorithm to see how synchronous it is. We also had to determine the size of the

computational chunks performed independently by the individual processors. The

matter of whether an algorithm uses coarse or fine grain parallelism influences the

choice of a communication model. All the models were used in the analysis, although
29

we modified the LogP model slightly. In our testing, detailed in Section 7, only the

Ethernet with broadcast is used. The following paragraphs explain this in detail.

As just mentioned, four parallel models were used to analyze the program.

These will be listed later in this section in Table 5.3. The first is a model of Ethernets

using a broadcast primitive. The second is a model of Ethernets not using a broadcast

primitive. These were chosen due to the common use of these topologies. The other

two models assume a completely connected topology. They are the BSP model and

another commonly used (CU) model. This commonly used model will be used instead

of the LogP model for two reasons. The LogP model is complex and it overestimates

the running times for programs that send large messages. Our program broadcasts

vectors which can be quite large and which can cause LogP to give bad estimates.

Whereas the CU model is asynchronous like LogP, BSP assumes synchronous

supersteps. Our algorithm has a few supersteps which makes BSP an interesting

model to use. By using both these models, we can see how long it should take both on

a synchronized system and on an asynchronous system.

5.2.2 Basic explanation of communication parameters

In the discussion below we are sending m double precision floating point

numbers across a network that has p processors. For illustration, we assume m to be

1,000 and p to be 100. The discussion would also apply to all m and p.

For an Ethernet we assume a simple two-parameter model of a LAN. If 1000

items are to be sent across the network within one message it takes s+1000g time. s

represents the startup time (latency) and g represents the items/sec (throughput). No

two communications can take place simultaneously. If the Ethernet supports a

broadcast primitive, sending the same 1000 items to p processors takes s+1000g; if
30

broadcasts are not supported then (99)(s+1000g) time is required. We ignore the

effects of collisions and retransmissions, and assume the Ethernet is lightly loaded.

The LogP model has three parameters: l, o and h. The definition of these

parameters is subtler than for those above. The parameter l is the time it takes for an

item to go from processor to processor over the network. Typically this is extremely

short. o is the operating system time taken by a processor to send or receive an item.

To send one item would cost l+2o. h is the time the processor must wait before

sending the next item. In this model, processor A can begin sending to another

processor C after sending to processor B before B has completely received its data.

Whether to send 10 items from one processor to another or to send one item from one

processor to 10 other processors (broadcast) takes l+2o+10h. Clearly, sending 10

items as a group will be faster on most architectures, which is why LogP will

overestimate the time for long transmissions.

The commonly used model (CU) has 2 parameters: s and g. These are the

same as in the Ethernet model. The difference is that different groups of processors

can communicate to each other simultaneously, whereas in the Ethernet only one

processor can successfully communicate at a time. Unlike the LogP model, in the CU

model a processor can't begin a session until its previous session is finished. To send

1000 items takes s+1000g.

The BSP model has 2 parameters: L and g. g is the same g as in the last model.

L is different than s. In the BSP, communication is synchronized. For each

synchronized period called a superstep one takes the maximum number of items any

one processor communicated (a send or receive) multiply it by g and add it to L. For


31

processor 1 to send 10 items to each of 100 processors in one superstep takes

L+1000g, since processor 1 sent a total of 100*10 items.

5.2.3 Analyzing basic communication operations

Our program uses Datagram sockets to take advantage of the Ethernet’s

broadcast facility. Our program has two basic communication segments that

correspond nicely to two basic parallel communication primitives, Allreduce and

Bcast, from MPI [Snir, 1996]. MPI is a parallel communications package that allows

many processors to communicate with one another. Available implementations,

unfortunately, do not take advantage of the Ethernet broadcast facility. This

motivated us to do our own parallel implementation using socket programming.

Section 6 describes MPI as well as the original reason for its choice as the parallel

package before sockets were used. For ease of reference, the two MPI primitive

names will be used for what we implement using sockets. Below is a brief analysis of

the Bcast and Allreduce MPI primitives. For a more in-depth analysis see Karp et al

[1993], who discusses the complexity of these operations.

Allreduce gets one element from each processor to a "root" processor. (This

first step is called Reduce.) This “root” calculates the maximum of these and

broadcasts the maximum to all the other processors. For a completely connected

network topology, Reduce has been shown to take the same time as Bcast since it

involves messages between the same endpoints in the opposite direction [Karp et al,

1993]. Allreduce takes at most 2*Bcast time. This bound is easily achieved by

executing Reduce followed by Bcast. Karp et al [1993] shows that optimally

Allreduce takes no longer then Reduce.


32

On an Ethernet, such as our network, Reduce is slower than Bcast because all

can listen at once, but only one can transmit. Although we assume in our expressions

that Reduce takes O(p) time for p processors note that Martel [1994] found O(log p)

to be the optimal time for Reduce on an Ethernet.

The Bcast command broadcasts a vector of m elements to all other processors.

Depending on the algorithm used to implement it, it can take different amounts of

time.

Timing for Ethernet with and without using Broadcast

For an Ethernet, Bcast takes s+mg or (p-1)(s+mg) depending on whether the

Ethernet broadcast primitive is being taken advantage of or not. Allreduce takes (p-

1)(s+g) or 2(p-1)(s+g) respectively.

Timing for the BSP model

For the BSP the Allreduce takes L+2(p-1)g. (This analysis assumes that the

Reduce and Bcast that are implemented within the Allreduce are done sequentially

with no overlap. This, in fact, is optimal.) This communication block is called a

“superstep.” We show how to Bcast m items both using one superstep and using two

supersteps. Assuming L is a large cost, we want to keep the supersteps to a minimum,

which is why we won't use a logarithmic tree algorithm such as in the CU model (see

the next paragraph). In the first algorithm the root sends the m items to the p-1

processors. This takes L+(p-1)mg time. In the second algorithm, the root splits the m

items into p parts of size m/p. It then sends a different part to each of the p-1

m
processors in the first superstep. This takes L + ( p − 1) time. In the second
p

superstep, each processor sends its part to the other p-2 processors (excluding itself
33

and the root). The root sends its portion to p-1 processors and receives the same. This

m
takes L + ( p − 1) g time. If you add them together you get approximately 2L+2mg
p

[Goudreau et al, 1999].

Timing for the Commonly Used (CU) model

For the commonly used model (CU), Allreduce takes 2(s+g)(log2 p). This can

be seen as follows. An Allreduce can be implemented as a Reduce and then a Bcast.

A Reduce is the command that gathers information from each processor to one “root”

processor. This is the opposite of a Bcast. To broadcast, a processor sends one item to

two other processors who in turn recursively each send to another two processors.

This is a cost of (s+g)(log2 p). It has been shown that an Allreduce performed using a

Reduce followed by a Bcast is optimal to within a factor of two. We assume this

implementation of Allreduce in our model, which is presented in Tables 5.1 - 5.4.

Optimally, an Allreduce can be done as quickly as a Reduce operation [Karp, 1993].

Bcast of m items depends on the algorithm used. In one algorithm, processor

A sends m items to processor B. Both processors recursively send to two other

processors. This takes (s+mg)(log2 p) time. The second algorithm splits the m items

into p/2 pieces of size 2m/p. A sends piece one to B, B sends it to C and so on. As

soon as C gets that piece (B is ready for more) A sends the next piece to B. The last

piece leaves A after p-1 time and arrives at the last processor after 2p-3 time.

2m 2m
p p
Pipeline : A → B → C → D L This takes

æ m ö m
çç s + 2 g ÷÷(2( p − 3) ) = 2 ps + 4mg − 6s − 12 g . This second algorithm can also be
è p ø p

extended from a simple pipeline to using a log2p tree, similar to the first algorithm.
34

More detailed explanation of the BSP and log2 p models can be found in Goudreau et

al [1999] and Culler et al [1993] respectively.

These MPI primitives are used in the steps described in Section 5.1. Allreduce

is used for step 2 “Column Proposal” and Bcast is used for step 3 “Pivot Column

Broadcast.” In our implementation we replaced the broadcast by an Ethernet multicast

operation, which takes advantage of Ethernet’s multicast/broadcast facility. Multicast,

as opposed to broadcast, which affects all processors on the LAN, only sends the

vector to the processors running the program.

5.3 The Experimental Environment

5.3.1 Description of lab(s)

Parallel experiments were performed on a set of homogeneous Sun

workstations. There are 7 identical Sun Ultra 5 Workstations (270 Megahertz), each

with 128 MB RAM, all running Solaris 5.7. A single 100-megabit per second

Ethernet with a shared file system connects them all.

Other workstations that weren’t identical were not used. During testing it was

important to make sure that there was no outside network traffic.

Experiments for the serial version, retroLP, were also performed on a PC. It is

a Dell 610MT Workstation with 384 MB RAM. It has a Pentium II processor running

at 400MHz with a 16KB L1 instruction cache; 16KB L1 data cache; 512KB

integrated L2 cache. The PC environment was used for the results given in Section

7.3.

5.4 Estimates of computation parameters ucol, urow and upiv

For the computation part, the times for division, multiplication and

comparison of a double were estimated. Measurements were made on a loop of


35

15,000 operations. The running time of this loop was then divided by 15,000. All

optimization was turned off in the compilation. In addition, a large array was

allocated. Each element of this array was operated on. This scheme does not allow the

compiler to optimize. It also matches the way operations are performed in a doubly

subscripted array (the tableau). A problem with using the estimates of division,

multiplication and comparison is that the row choice is not only divisions, and the

column choice is not only comparisons, and the pivot is not only multiplications. To

get a closer estimate, the functions for the three different parts of the simplex method

were called for a matrix of size m=1,000 by n=10,000. The three parts are column

choice, row choice and pivot. Note that in practice, many columns are not looked at in

detail within the column choice. It is only necessary to look at the sign of the cost

coefficient and the bound of the non-basic variable. In one empirical test we found

that only 35% of the columns were eligible candidates for the basis; the other

columns could be immediately eliminated. This is further discussed in Section 7.2.1

in reference to the steepest edge column choice rule where this observation makes a

substantial difference. The resulting times were then divided by m=1,000, n=10,000

and mn=10 million for row choice, column choice and pivot respectively. This gave a

"unit" time for that part of the algorithm. This unit is in a sense the amortized time of

all the multiplication and division operations, as well as any other part of the

calculation. These units are listed in the last three columns of Table 5.1. The running

time for any other size problem can be estimated by multiplying the unit row choice

time by m, the unit column choice time by n and the unit pivot time by mn. Although
36

the timing estimates for division, multiplication and comparison given in Table 5.1

are not used in the calculation, they are provided for reference.

Regression is another way of estimating these coefficients. The first step is to

run the actual program on many different problem sizes using differing numbers of

processors. We then tabulate a list of actual computation times taken by the program

runs. Linear regression is then used to estimate the coefficients of our timing

expressions. This is discussed further in Section 7 where we discuss how well our

expressions predict the actual computation times.

5.5 Estimates of communication parameters s, g and L

The network startup time s (latency) can be estimated by sending a 0-length

packet many times; in our experiments, about 10,000. We sent the packet from A to B

then from B back to A [Dongarra and Donigen, 1996]. This is called Ping-Pong. We

then took the total elapsed time and divided by 20,000 to get the per communication

(send/receive) latency. For g, which measures bandwidth, we sent a double vector of

length 8,000 (64Kbytes). We then took the total elapsed time and divided it by 8,000.

We ignored startup time since it should be small relative to the transmission of a large

packet. One can estimate L by L=2s(log2 p).

We estimated these parameters experimentally for the second network of Sun

workstations described in Section 5.3. s and g were calculated for use in the

communication part of the running time.

As by computation, s and g can be estimated through use of regression. In this

case regression yielded better results than the results yielded via direct

experimentation. This is more clearly explained in Section 7.


37

Explanation of Table 5.1 and Table 5.2

Each row of Table 5.1 corresponds to one estimate. This was estimated a

number of times. The average is in the bottom row. Referring to Table 5.1 at the

bottom: s=2.10*10-3 seconds, g=1.76*10-6 seconds, division=1.32*10-7 seconds,

multiplication=7.87*10-8 seconds and comparison=3.81*10-9 seconds. The unit row

choice is 1.65*10-6 seconds, the unit pivot is 1.24*10-7 seconds and the unit column

choice is 3.73*10-8 seconds. L did not have to be calculated since it is a function of s

and p.

As explained in Section 7, there are only two significant coefficients. One is

upiv, which is the coefficient for the pivot step. The second is ucol_se, which is the

coefficient for the column choice rule when the steepest edge column choice rule is

being used. The final coefficient values upiv and ucol_se, used in the formulas for the

6 different communication models, are listed in Table 5.2. These 6 models are listed

in Table 5.3 and are explained in the next section.


38

s g division multiplication comparison unit row choice(urow) unit pivot(upiv) unit col choice(ucol) unit col choice(ucol_se)
1.76E-03 1.23E-06 1.27E-07 1.09E-07 3.86E-09 1.66E-06 1.24E-07 3.73E-08 3.41E-0
1.74E-03 1.82E-06 1.35E-07 7.41E-08 3.80E-09 1.65E-06 1.25E-07 3.74E-08 3.43E-0
2.46E-03 1.88E-06 1.31E-07 7.38E-08 3.80E-09 1.63E-06 1.26E-07 3.72E-08 3.43E-0
2.35E-03 1.81E-06 1.38E-07 7.39E-08 3.80E-09 1.66E-06 1.24E-07 3.72E-08 3.42E-0
2.05E-03 1.85E-06 1.31E-07 7.77E-08 3.80E-09 1.71E-06 1.24E-07 3.71E-08 3.41E-0
2.64E-03 1.84E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.24E-07 3.77E-08 3.43E-0
1.94E-03 1.82E-06 1.32E-07 7.37E-08 3.80E-09 1.64E-06 1.26E-07 3.72E-08 3.43E-0
1.88E-03 1.84E-06 1.33E-07 7.37E-08 3.80E-09 1.63E-06 1.24E-07 3.71E-08 3.42E-0
2.10E-03 1.76E-06 1.32E-07 7.87E-08 3.81E-09 1.65E-06 1.24E-07 3.73E-08 3.42E-0

Table 5.1-Coefficient estimates

upiv 1.24E-07
ucol_se 3.42E-07

Table 5.2 – Coefficient estimates used in models


39

ethernet ethernet St. edge ethernet fully connected fully connected fully connected fully connected
activity broadcast broadcast no broadcast common model common model BSP BSP
Alg 1 Alg 2 Alg1 Alg 2
comp. get local max (n/p)ucol (mn/p)(se_ucol) (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol (n/p)ucol

comm. Allreduce (P-1)(s+2g) P(s+2g) 2(P-1)(s+g) (s+g)(log P) (s+g)(log P) L+2(P-1)g L+2(P-1)g

comp. winner calcs. Pivot row (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow (m)urow

comm. Bcast column + int (s+(m+1)g) (s+(m+1)g) (P-1)(s+(m+1)g) (s+mg)(log P) 2PS+4mg L+(P-1)mg ~2L+2mg

comp. do pivot ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv ((n+1)(m+1)/p)upiv

Table 5.3- Expressions of the six models

n 100 1000 10000 50000 100000


p* time p* time p* time p* time p* time
eth-broad 8.1 0.005832156 25.3 0.012539132 80.1 0.033778564 179.1 0.072178696 253.3 0.1009531

eth-broad St. edge 15.5 0.008733838 49.1 0.021736331 155.1 0.062869419 346.9 0.137229171 490.6 0.192948601
eth-nobroad 2.5 0.009358383 7.7 0.031334104 24.5 0.100926097 54.7 0.226745932 77.3 0.321026613
comm. Mod alg1 4.6 0.005184726 45.8 0.007063647 457.3 0.008949164 2286.5 0.010267533 4572.9 0.010835344
comm. Mod alg2 26.4 0.022746269 31.3 0.029126954 62.2 0.063022539 129.4 0.129994516 181.4 0.180583468
BSP alg1 13.3 0.023429024 15.8 0.034337139 31.3 0.090299238 65.3 0.19768193 91.6 0.278111702
BSP alg2 64.7 0.011540454 203.8 0.014284651 644.1 0.018843167 1440.2 0.024959694 2036.7 0.029115974

Table 5.4 – p* and optimal time per iteration


40

5.6 The performance model

The program as explained in Section 3 is an iterative method. It therefore

depends on the number of iterations. Our method parallelizes each iteration.

Assuming we use the same column choice rule whether we use the full tableau

method or the revised method, the number of iterations should be the same. Our

analysis therefore focuses on the timing within an iteration. To get the total running

time the number of iterations can then be multiplied by the value calculated. We are

careful to only include the time spent solving the problem; e.g., the time taken to read

input or report output is not included.

Explanation of Table 5.3

Section 5.2 gave a short sketch of one distributed pivot step. Step 1 in Section

5.2 consists of computation within each processor of its local maximum followed by

communication of the maximums to the coordinating processor. The timing for these

actions is given in the first two rows of Table 5.3. Step 2 in Section 5.2 consists of

computation within the “winning” processor of the leaving basic variable (row

choice) and the communication via broadcast of both its pivot column together with

its row choice. The timing for these actions is given in rows 3 and 4 of Table 5.3.

Finally the timing of the pivot within Step 3 of Section 5.2 is given in the last row of

Table 5.3. Note that this analysis assumes that pivot steps will be performed every

iteration.

urow, ucol, upiv, s and g are constants as in the previous section. se_ucol is

the constant for the steepest edge column choice rule. This constant is close to upiv in

value. In rows 1, 3 and 5 each processor goes through its n/p columns, m rows and
41

n +1
(m + 1) matrix elements respectively. Computation calculations similar to those
p

in rows 1, 3 and 5 in Table 5.3 can be found in Nash and Sofer [1996, pgs. 114-116].

The communication rows (2 and 4) of Table 5.3 were explained in Section 5.2.2.

Each column of the table corresponds to a model. To get the complete running

time of the program for a given model, simply sum that column and multiply by the

number of iterations. This assumes the classical column choice rule was used with a

pivot in each iteration within Step 3 of the steps listed in Section 5.2. The second

column is the only column that assumes the steepest edge rule. The only change from

the first column is in the first row.

5.7 The optimum number of processors

For our distributed algorithm, there is a tradeoff between communication time

and computation time. As the number of processors increases, the computation time

decreases, but the communication time increases. For each communication model, we

can estimate the optimal number of processors to use by taking the derivative of the

timing function of the algorithm with respect to p. This is then set to 0 and solved for

p. The timing functions correspond to the columns in Table 5.3.

As an example consider the Ethernet model without broadcast. Using the

given formulas with our estimated values of s, g, ucol, urow, upiv with m = 100 and n

= 50,000 results in an optimal number of processors p = 16.3. In practice this number

would have to be rounded.


42

n æ (n + 1)(m + 1) ö
f ( p) = ucol + 2(p - 1)(s + g ) + m * urow + ( p − 1)(s + (m + 1)g ) + çç ÷÷ upiv
p è p ø
df -n * ucol + (n + 1)(m + 1)urow
= + 3s + 3 g + mg = 0
dp p2

Table 5.4 gives the predicted optimal number of processors for m = 1,000

where n varies from 100 to 100,000.

The following expressions generalize this example. The expressions drop the

units (m instead of m+1). γ, ρ, π and Γ are ucol, urow, upivot and se_ucol

respectively. p* is the optimum p. Three expressions are given. The first corresponds

to the Dantzig rule assuming Ethernet with broadcast as in our example above. The

second corresponds to the Steepest Edge rule again assuming Ethernet with broadcast.

The third expression corresponds to the Dantzig rule but the communication model is

Ethernet without the broadcast facility.

γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
Dantzig Rule:
γ n + π mn
p* =
s+g

Γm n π mn
T= + ρm+ + s + gm + ( s + g ) p
p p
Steepest Edge:
m n (Γ + π )
p* =
s+g

γn π mn
T= + ρm+ + [ s + gm] p + ( s + g ) p
p p
Linear Broadcast:
γ n + π mn
p* =
2s + g (m + 1)
43

5.8 The optimum number of processors with a new column division scheme

Classic rule

The optimum p* derived in the previous section assumed that columns were

equally divided among the processors. After the processors each calculated their best

choice of entering column the processors did a communication step. This

communication took a total of ps time; one proposal for each processor. If we can

give an unequal number of columns to each processor, some of the communication

would be simultaneous with computation.

We would like to divide the n columns in the following way:

n = n0, n 0 + k, n 0 + 2k, L , n 0 + (p - 1)k Þ


p ( p − 1)
n = n0 p + k Þ
2
n p −1
n0 = − k
p 2

Where n0 is the number of columns every processor has as a base and k is the

additional columns a processor gets over the previous processor. k should be the

number of columns whose computation takes the time of one send.

s + g = (πm + γ )k Þ
s+g
k=
πm + γ

Notice that now instead of a cost of ps there is only the cost of the s of the last

processor. The new timing function and p* is

T = γ (n0 + ( p − 1)k ) + ρm + πm(n0 + ( p − 1)k ) + s + gm + s + g


2n γ n + π mn
p* = = 2
k s+g
44

This scheme’s p* is a factor of the square root of 2 of the old scheme’s p*.

As an example take a problem where m=1,000 and n=5,000. The first scheme

calculates

p* = 53.53
n
n0 = = 93.40
p
k =0
T = .02347 seconds per iteration

The new scheme calculates

p* = 75.95
n0 = .866
k = 1.73
T = .01761 seconds per iteration

This is a significant savings in time.

All of the calculations and experiments in this paper used the first scheme.

The last scheme is mentioned in the future work section (Section 8).

Steepest edge rule

A similar analysis is provided for the steepest edge column choice rule.

s + g = (πm + Γm)k Þ
s+g
k=
πm + Γ m

T = Γm(n0 + ( p − 1)k ) + ρm + πm(n0 + ( p − 1)k ) + s + gm + s + g


2n Γm n + π mn
p* = = 2
k s+g

This scheme’s p* is also a factor of the square root of 2 larger than the old

scheme’s p*.
45

Figure 5.1 shows the time per iteration as n increases. Both schemes are

compared using both column choice rules.

0.5000
0.4500
0.4000
time per iteration

0.3500
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
-

0
00

00

0
00

00

00

00

00
5

0,
4,

5,

10

25

50

75

10
n
Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2

Figure 5.1- time per iteration as n increases

5.9 Estimated parallel running time for each communication model

Table 5.5 consists of the estimated timing for all the communication methods.

We use the results of Table 5.4 as the number of processors (p) to plug into the

expressions in Table 5.3. In Table 5.5, m is set at 1,000. It is now easy to estimate the

ratio of computation to communication. For example in the Ethernet-broadcast model

with n=50,000, computation time is a total of .0359 seconds per iteration.

Communication takes .0362 seconds per iteration.


46

n 100 1000 10,000 50,000 100,000


eth-broad col choice comp. 3.11E-10 9.86E-10 3.12E-09 6.98E-09 9.87E-09
comm. 1.57E-03 4.72E-03 1.53E-02 3.45E-02 4.89E-02
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 1.69E-03 1.69E-03 1.69E-03 1.69E-03 1.69E-03
pivot comp. 1.56E-03 4.91E-03 1.55E-02 3.47E-02 4.91E-02
comp: 2.77E-03 6.12E-03 1.67E-02 3.59E-02 5.03E-02
comm: 3.27E-03 6.41E-03 1.70E-02 3.62E-02 5.06E-02
total: 6.04E-03 1.25E-02 3.38E-02 7.22E-02 1.01E-01
eth-broad col choice comp. 2.20E-03 6.97E-03 2.21E-02 4.93E-02 6.98E-02
St. Edge comm. 3.04E-03 9.59E-03 3.03E-02 6.78E-02 9.59E-02
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 1.69E-03 1.69E-03 1.69E-03 1.69E-03 1.69E-03
pivot comp. 8.09E-04 2.54E-03 8.02E-03 1.79E-02 2.54E-02
comp: 4.22E-03 1.07E-02 3.13E-02 6.85E-02 9.63E-02
comm: 4.73E-03 1.13E-02 3.20E-02 6.95E-02 9.75E-02
total: 8.95E-03 2.20E-02 6.33E-02 1.38E-01 1.94E-01
eth-nobroad col choice comp. 1.02E-09 3.23E-09 1.02E-08 2.29E-08 3.23E-08
comm. 5.65E-04 2.61E-03 9.09E-03 2.08E-02 2.96E-02
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 2.47E-03 1.14E-02 3.97E-02 9.09E-02 1.29E-01
pivot comp. 5.12E-03 1.61E-02 5.09E-02 1.14E-01 1.61E-01
comp: 6.33E-03 1.73E-02 5.21E-02 1.15E-01 1.62E-01
comm: 3.03E-03 1.40E-02 4.88E-02 1.12E-01 1.59E-01
total: 9.36E-03 3.13E-02 1.01E-01 2.27E-01 3.21E-01
CU (logP) col choice comp. 5.41E-10 5.46E-10 5.47E-10 5.47E-10 5.47E-10
alg 1 comm. 4.28E-04 1.07E-03 1.71E-03 2.16E-03 2.36E-03
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 3.74E-03 9.34E-03 1.50E-02 1.89E-02 2.06E-02
pivot comp. 2.72E-03 2.72E-03 2.72E-03 2.72E-03 2.72E-03
comp: 3.93E-03 3.93E-03 3.93E-03 3.93E-03 3.93E-03
comm: 4.16E-03 1.04E-02 1.67E-02 2.10E-02 2.29E-02
total: 8.10E-03 1.43E-02 2.06E-02 2.50E-02 2.69E-02
CU (logP) col choice comp. 9.48E-11 7.98E-10 4.02E-09 9.66E-09 1.38E-08
alg 2 comm. 9.15E-04 9.64E-04 1.16E-03 1.36E-03 1.45E-03
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 1.61E-02 1.81E-02 2.99E-02 5.58E-02 7.58E-02
pivot comp. 4.77E-04 3.97E-03 2.00E-02 4.81E-02 6.86E-02
comp: 1.69E-03 5.18E-03 2.12E-02 4.93E-02 6.98E-02
comm: 1.71E-02 1.90E-02 3.11E-02 5.72E-02 7.72E-02
total: 1.87E-02 2.42E-02 5.23E-02 1.06E-01 1.47E-01
BSP alg1 col choice comp. 1.89E-10 1.59E-09 7.98E-09 1.91E-08 2.73E-08
comm. 1.47E-03 1.58E-03 2.00E-03 2.51E-03 2.78E-03
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 1.98E-02 2.37E-02 4.74E-02 9.88E-02 1.38E-01
pivot comp. 9.48E-04 7.90E-03 3.97E-02 9.52E-02 1.36E-01
comp: 2.16E-03 9.11E-03 4.09E-02 9.64E-02 1.37E-01
comm: 2.13E-02 2.52E-02 4.94E-02 1.01E-01 1.41E-01
total: 2.34E-02 3.44E-02 9.03E-02 1.98E-01 2.78E-01
BSP alg2 col choice comp. 3.86E-11 1.23E-10 3.88E-10 8.68E-10 1.23E-09
comm. 2.51E-03 3.56E-03 5.52E-03 8.35E-03 1.03E-02
row choice comp. 1.21E-03 1.21E-03 1.21E-03 1.21E-03 1.21E-03
comm. 7.63E-03 8.90E-03 1.02E-02 1.11E-02 1.15E-02
pivot comp. 1.94E-04 6.11E-04 1.93E-03 4.32E-03 6.11E-03
comp: 1.40E-03 1.82E-03 3.14E-03 5.53E-03 7.32E-03
comm: 1.01E-02 1.25E-02 1.57E-02 1.94E-02 2.18E-02
total: 1.15E-02 1.43E-02 1.88E-02 2.50E-02 2.91E-02
Table 5.5 – Time per iteration for m=1,000

5.10 Running time estimates of the revised (MINOS), serial (retroLP) and

parallel Full-tableau algorithms (dpLP)

Table 5.6 is based on the models of Sections 3.4, 5.5-5-8 and uses the

coefficients given in Table 5.2. Table 5.6 compares the estimated running time per

iteration of three algorithms for problems of varying size. The three algorithms we

compared are our serial full-tableau simplex method, the revised method and our
47

parallel full-tableau simplex method. The parallel simplex in the table assumes an

Ethernet with broadcast and the optimum number of processors. Two sets of optimal

number of processors are shown corresponding to the simple scheme of equally

dividing up the columns amongst the processors and the scheme proposed in Section

5.8. Both the serial time and the parallel time are also shown when the steepest edge

column choice rule is used. Times per iteration for the revised method were

estimated, assuming both a dense (100%) tableau and a sparse (5%) tableau. Both

densities were not shown for the full tableau method because density has very little

effect on the running time of the full-tableau algorithms whereas the revised method

runs more quickly on sparse problems. The last two columns of Table 5.6 show the

optimum number of processors to use when employing the standard and steepest edge

column choice rules respectively.

The optimum number of processors p would in practice be an integer even

though mathematically it can be a fraction. The timing of the algorithm is not

sensitive to this approximation (see Section 5.14). Note that the revised method, for

completely dense problems, is slower than the full tableau for all n in the table (aside

from the last line). As n rises, the revised method catches up at a very slow rate. The

revised method catches up when n is equal 2m2+m as calculated in Section 3.3. This

is shown on the last line.


48

m n serial serial S.E. Par. Bcast Bcast scheme 2 Par. Bcast S.E. Bcast SE sch 2 revised dense revised sparse p*-standard p*-St.Edge
5,000 4,500 2.80 10.50 0.0601 0.0468 0.1038 0.0776 12.13 3.56 120.13 232.68
5,000 5,000 3.12 11.67 0.0627 0.0486 0.1087 0.0811 12.44 3.58 126.63 245.26
5,000 10,000 6.22 23.33 0.0830 0.0629 0.1481 0.1089 15.55 3.73 179.07 346.85
5,000 25,000 15.55 58.32 0.1233 0.0915 0.2262 0.1642 24.87 4.20 283.13 548.41
5,000 50,000 31.09 116.64 0.1688 0.1236 0.3143 0.2265 40.41 4.97 400.40 775.56
5,000 75,000 46.63 174.95 0.2037 0.1483 0.3819 0.2743 55.95 5.75 490.39 949.87
5,000 100,000 62.18 233.27 0.2331 0.1691 0.4389 0.3146 71.49 6.53 566.25 1,096.81

10,000 9,000 11.20 42.00 0.1203 0.0933 0.2076 0.1550 48.50 14.24 240.24 465.34
10,000 10,000 12.45 46.66 0.1253 0.0968 0.2173 0.1619 49.74 14.30 253.23 490.51
10,000 20,000 24.88 93.31 0.1660 0.1256 0.2961 0.2176 62.17 14.92 358.12 693.68
10,000 50,000 62.18 233.27 0.2467 0.1826 0.4524 0.3281 99.47 16.79 566.22 1,096.80
10,000 100,000 124.34 466.52 0.3376 0.2470 0.6286 0.4527 161.63 19.89 800.76 1,551.10
10,000 150,000 186.51 699.77 0.4074 0.2963 0.7638 0.5483 223.79 23.00 980.72 1,899.71
10,000 200,000 248.67 933.02 0.4663 0.3379 0.8778 0.6289 285.94 26.11 1,132.44 2,193.59
10,000 500,000 621.66 2,332.54 0.7215 0.5184 1.3721 0.9785 658.89 44.76 1,790.54 3,468.37
10,000 1,000,000 1,243.31 4,665.07 1.0091 0.7217 1.9293 1.3724 1,280.48 75.84 2,532.21 4,905.02

10,000 300,010,000 373,000.75 1,399,562.96 17.0359 12.0538 32.9740 23.3240 373,000.75 18,661.86 43,859.85 84,958.82
Table 5.6- estimated running time per iteration
49

80.0

70.0 serial

60.0

50.0 Par. Bcast

40.0

30.0
revised
dense

time per iteration


20.0

revised
10.0 sparse

-
4,500 5,000 10,000 25,000 50,000 75,000 100,000
n

Figure 5.2 – Time per iteration (m=1,000)


50

In Figure 5.2, the 3 algorithms are compared. The figure corresponds to the

top part of Table 5.6 where m is 5,000. The x-axis is n and the y-axis is the estimated

time per iteration in seconds. This comparison is for dense problems. The revised is

even slower than the standard simplex. This is due to the extra computation needed to

calculate the objective row and the pivot column. As just mentioned, the figure shows

the revised method slowly catching up to the full tableau method as n increases. It

takes such a large column size to catch up that it seems reasonable to say that for all

practical dense problems the revised method is slower than the standard method.

Moving over to the Ethernet based parallel algorithm one can see that the added time

for a higher n is minimal. It is only the cost of sending a larger vector over the

Ethernet, which is a very small cost for an Ethernet with a broadcast facility.

There are four variables to deal with when comparing the different methods.

They are Aspect Ratio (AR=n/m), size (m), density (d), and number of processors (p).

Figures 5.3, 5.4 and 5.5, compare the serial full tableau method using the

standard column choice rule, the serial full tableau method using the steepest edge

column choice rule and the revised method. They compare them as density, aspect

ratio and m are varied, respectively. The base problem is m=100, p=1 (serial method),

density=5% and AR=10 (which implies that n=1,000).

Figure 5.6 is the fourth graph of the group. It shows a comparison between the

parallel full tableau method using the standard column choice rule, the parallel full

tableau method using the steepest edge column choice rule and the revised method.

The parallel method uses an Ethernet with broadcast. We also include the Ethernet

without use of broadcast.


51

Figure 5.7 is the same as Figure 5.6 except that it is for a larger problem

where m=10,000 and n=100,000. We graphed this in order to show a problem for

which the parallel method using a small number of processors would actually

overtake the revised method.

In both Figure 5.6 and Figure 5.7 the point on each curve that is the lowest

time per iteration is where the optimum p (p*) is being used. The p* values for the

parallel methods graphed in Figure 5.7 are:

eth-broad 75.7

eth-broad St. edge 143.5

eth-nobroad 20.1

Notice that these numbers can be rounded to whole integers. From the graphs

one can see that the time per iteration for Ethernet using broadcast is not very

sensitive to p in the range of p*/2 to p* (see Section 5.7).

0.05
0.05
0.04
0.04
time per iteration

0.03
0.03
0.02
0.02
0.01
0.01
0.00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

density (non-zero coefficients)

revised serial Serial S.E.

Figure 5.3-Time per iteration as a function of density


52

0.20

time per iteration


0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0.5 0.8 1 5 10 15 20 25 30 35 40

Aspect Ratio
revised serial Serial S.E.

Figure 5.4-Time per iteration as a function of aspect ratio

1.80

1.60

1.40
time per iteration

1.20

1.00

0.80

0.60

0.40

0.20

0.00
50 100 200 300 400 500 600 700 800 900 1000

number of rows, m
revised serial Serial S.E.

Figure 5.5- Time per iteration as a function of m


53

5.0E-02

4.5E-02

4.0E-02

3.5E-02

3.0E-02

2.5E-02

2.0E-02

1.5E-02

time per iteration


1.0E-02

5.0E-03

0.0E+00

1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45

number of processors, p

revised Eth-Bcast Eth-NoBcast Eth St.edge


Figure 5.6 – Time per iteration as a function of p
54

5.00E-01

4.50E-01

4.00E-01

3.50E-01

3.00E-01

2.50E-01

2.00E-01

time per iteration


1.50E-01

1.00E-01

5.00E-02

0.00E+00

1
8
6
3
0
7
4
1
8
5

15
22
29
36
43
50
57
64
71
78
85
92
99
10
11
12
12
13
14
14
15

number of processors, p

revised Eth-Bcast Eth-NoBcast Eth St.edge


Figure 5.7 – Time per iteration as a function of p
55

Analysis of figures 5.3 – 5.7

For sparse problems the revised is much quicker than the serial full tableau

method. Our distributed method when used with the optimal number of processors on

an Ethernet with broadcast is even faster than the revised method. We thus have two

important parameters that determine whether the revised method or our full tableau

(retroLP and dpLP) algorithms are faster. One parameter is density and the other is

the number of processors that we have. If we have a completely dense problem our

method is faster. If we have the optimum number of processors our method is often

faster even on sparse problems. This can be seen in Figures 5.6 and 5.7. The figures

refer to problems with only 5% density. The parallel algorithm on an Ethernet with

broadcast comes extremely close to the time of the revised, for a problem size of 100

x 1,000 (figure 5.6) for its optimum 7.6 processors. For a problem size of 1,000 X

10,000 the parallel method would overtake the revised if it had 7 processors, much

lower than its optimum. This is shown in Figure 5.7. The optimum p for that example

is about 76 and 144 processors for the standard and steepest edge rules respectively.

Even in our first example a slightly higher density would already cause the revised to

be slower. In practice our method can therefore be used in two cases. One case:

problems that have a significant number of nonzero coefficients. Examples of such

applications are given in Section 7. The second case occurs whenever we have access

to many processors – even for problems that are not dense.

5.11 Advantage of the Steepest Edge column choice rule

These comparisons assume all methods are using the classical Dantzig column

choice rule. The tableau method and especially the distributed method can make very

effective use of alternative column choice rules to further improve the relative
56

performance. One of these rules is the steepest edge rule. As mentioned in Section

4.1.2 it is not necessary to re-compute any columns in order to perform this rule. It

simply requires, at most, an extra mn multiplications within the column choice step.

The steepest edge rule is included in these tables and in Figures 5.6 and 5.7. Notice

how expensive it is to use with only one processor and how quickly it speeds up as

the processor number increases. Here we only compare time per iteration. The gain of

the alternate column choice rules is in the fewer number of iterations that are

necessary. This extra computation is more than offset by the reduction in the number

of iterations [Harris 1973; Goldfarb and Reid 1977; Forrest and Goldfarb, 1992]. See

Section 7.4 for experimental results supporting this view. This extra computation is

shared amongst the processors just like the rest of the computation. These rules can

be faster than the classical rule even without the use of multiple processors. This

effect is compounded when using multiple processors.

Note that the expressions and all the graphs assume that every column is

looked at using the steepest edge column choice rule. This in fact is not true. In one

empirical test we found, on average, that only 35% of the columns were eligible and

therefore looked at. All the charts and graphs assume all columns were eligible. This

is an upper bound. The steepest edge does even better in practice than that which is

depicted in the graphs. This is also discussed in Section 7.

An expression for time per iteration using the steepest edge rule was provided

in Section 5.8. As the number of processors increases, the iteration time of the

steepest edge rule becomes more competitive with the iteration time of the standard

rule. It cannot actually catch up to the speed of the standard rule per iteration. We
57

would like to calculate the percentage fewer iterations needed to make the steepest

edge overtake the standard column choice rule. The expression for this is simply time

per iteration of the standard rule divided by time per iteration of the steepest edge.

For a problem of size 1,000 by 10,000 (Figure 5.7) the percentage when p=1 is

27.89%. This means that the steepest edge must iterate 27.89% of the iterations

performed by the standard method in order for it to catch up in speed. The higher the

percentage, the better it is for the steepest edge method. For 76 processors (the

optimum for the standard method) the percentage is 46.04%. For 144 processors (the

optimum for the steepest edge) the percentage is 65.92%. Section 7.3 shows how

many iterations were actually performed by the steepest edge method in practice.

5.12 Memory requirements of revised and full tableau method

The full tableau holds a data matrix of m+1 by n+1 double precision floating-

point numbers. It also has a few auxiliary vectors, which are not included in this

calculation. (6 vectors of size m and 6 vectors of size n.) A double precision element

is 8 bytes. That amounts to 8(m+1)(n+1) bytes. A problem with table size 100x100

(m=100, n=100) takes 81,608 bytes and a problem size of 1,000 X 10,000 takes

80,088,008 bytes. Our parallel implementation would cut memory requirements on

each processor to approximately memory/p. where p is the number of processors and

memory is the memory required when using only one processor. The revised method

requires (m+1)(n+1) elements for the original data. In addition it requires a (m+1) by

(m+1) matrix for the inverse of the basis assuming an explicit basis inverse is

maintained. It also has an extra vector of size m+1 for the pivot column, which is not

included in the calculation. This equals (m+1)(n+1)+(m+1)(m+1). A double precision


58

element takes up 8 eight bytes. That amounts to

8[(m + 1)(n + 1) + (m + 1)(m + 1)] bytes. A problem with table size 100x100 takes

163,216 bytes and a problem size of 1,000 X 10,000 takes 88,104,116 bytes. Figure

5.8 is a graph of memory requirements in megabytes for both the tableau method and

the revised method as m gets larger and assuming n=10m. Notice that for dense

problems with this aspect ratio, the revised method takes approximately 10% more

space than the full tableau method.

1 1
1+ 1+
(m + 1)(n + 1) + (m + 1)(m + 1) m +1 m = 1+ m
= 1+ = 1+
(m + 1)(n + 1) n +1 n 1 1
+ 10 +
m m m

This assumed a direct inverse representation of the basis for the revised

simplex method. Functional equivalents would take a similar amount of storage in the

dense case. For sparse problems the revised method saves a lot of space by using

sparse matrix techniques.


59

10,000.00
9,000.00
8,000.00
7,000.00
6,000.00
memory full tableau
5,000.00
revised
4,000.00
3,000.00
2,000.00
1,000.00
-
0

00

00

00

00

00
10

10

30

50

70

90
m

Figure 5.8 – Memory (in Megabytes)

5.13 Asymptotic (computation/communication ratio change) analysis

The coefficients used here were based on the network described in Section

5.3. It is interesting to see how it would fare on more current networks and on

networks in the future where the computation to communication time ratios may

change.

We will assume that both s and g communication parameters would change

together. On our network s=3*10-4 and g=1.7*10-6 where g corresponds to sending

one double (for a byte it would be .425*10-6). The computation operations for a

double floating point number are of order 10-7. These parameters include unit column

choice for steepest edge, unit pivot, multiplication and division. For simplicity all

computational operations in this analysis will be assumed to be 10-7 even though there

are slight variations in practice.


60

Figure 5.9 and Figure 5.10 show the asymptotic change in the speedup of our

parallel program, when exactly 7 processors are used, as the ratio of the computation

time to communication time changes. The leftmost points in both figures show the

current speedup assuming the current speeds of s, g and computation. Each point

along the horizontal axis assumes that the communication or computation speed, for

Figures 5.9 and 5.10 respectively, doubles from the speed of the point to its left.

Figure 5.9 shows what happens when both s and g get faster and the computation

speed is held constant. On each subsequent point the speed of both s and g are

doubled. Figure 5.10, on the other hand, shows what happens when the computation

speed gets faster and both s and g are held constant. On each subsequent point the

computation speed is doubled. We assume a problem size of 1,000 by 10,000 running

on an Ethernet with broadcast and where the optimum p, p*, is used in dpLP.

From the figures we can see that the speedup is affected very much by

changes in computation and communication parameters. As s and g get faster we can

take advantage of more processors since the communication costs are lower. On the

other hand, as computation speeds increase, the relative communication cost

increases. This causes a decrease in speedup since we can use fewer processors.
61

350.0

300.0

250.0
speedup
200.0

150.0

100.0

50.0

0.0
1 2 4 8 16 32 64 128

computation time/communication time ratio scaled

speedup

Figure 5.9 – Asymptotic speedup when a unit computation = 10-7 . s and g move together
62

30.0

25.0

20.0
speedup

15.0

10.0

5.0

0.0
1.000 0.500 0.250 0.125 0.063 0.031 0.016 0.008
computation time/communication time ratio scaled

speedup

Figure 5.10 - Asymptotic speedup time when s = 3.4*10-7 and g = 1.7*10-6 . Computation is
changing

5.14 Sensitivity to s, π and p

In estimating parameters it is important to know how sensitive the timing is to

small changes in the parameters. This is important because, before the program is run,

our timing expressions tell us how many, p*, processors to use. We then know the

maximum number of processors that we need for the problem.

p*, if only changed slightly, does not significantly affect the overall running

time. This suggests that we can round p* to the nearest whole number without

significantly affecting the predicted running time. More importantly, it might be

difficult to use the optimum p* when it is a large number. We would like to know
63

how much the timing would be affected if we use a few less processors than the

optimum.

Sensitivity of timing to changes in p at p*:

Assume that we round the optimal number of processors, p*, to its nearest

integer pint*, and we then run the problem with pint* processors. We should have used

p* (not pint*). The amount of time the program runs with the wrong pint* is Tint. We

should have used p* processors which would have given a running time of T.

T = T ( p *)
Tint = T ( pint *)
γn π mn
T= + ρ m+ + s + gm + ( s + g ) p
p p
γ n + π mn
p* = Þ γ n + π mn = p * 2 ( s + g ) (1.1)
s+g
64

γ n + π mn
T ( p*) = + ( ρ + g )m + ( s + g ) p * + s
p*
γ n + π mn γ n + π mn
= s + g + ( ρ + g )m + ( s + g ) +s
γ n + π mn s+g
= γ n + π mn s + g + ( ρ + g )m + s + g γ n + π mn + s
1 1

= 2(γ n + π mn) ( s + g ) + ( ρ + g )m + s
2 2

1 1

= 2 p * ( s + g ) ( s + g ) + ( ρ + g )m + s
2 2
from (1.1)
T ( p*) = 2 p * ( s + g ) + ( ρ + g )m + s
≥ 2 p * (s + g )

γ n + π mn
Tint = + ( ρ + g )m + ( s + g ) p int + s
p int

γ n + π mn
Tint − T ( p*) = + ( s + g ) p int − 2 p * ( s + g )
p int
p *2 (s + g )
= + ( s + g ) p int − 2 p * ( s + g ) from (1.1)
p int
æ p *2 ö
= ( s + g )çç + p int − 2 p *÷÷
è p int ø
æ p * + p int − 2 p * p int
2 2
ö
= ( s + g )ç ÷
ç p int ÷
è ø
æ ( p * − p int ) 2 ö
= ( s + g )çç ÷
÷ perfect square
è p int ø
æ (.5) 2 ö
< ( s + g )çç ÷÷ round p * to the nearest integer
è p int ø
1æs+ gö
< ç ÷
4 çè p int ÷ø

1æs+ gö
ç ÷
Tint − T ( p*) 4 çè p int ÷ø 1
< = bound
T ( p*) 2 p * ( s + g ) 8 p int p *
A change in p* does not substantially affect T.
65

This can also be seen approximately by the second derivative in the Taylor series:

dT − γ n − π mn
= +s+g
dp p2
γ n + π mn
T"=
p3
2( s + g )
T '' = at p *
p
1
Tint − T ( p*) = T ' ( p*)( pint − p*) + T " ( pφ ( pint − p*) 2
2
1 1
| Tint − T ( p*) |≤ T " ( pφ ) since p int − p * < .5
2 4
1 γ n + π mn 1 p * 2 ( s + g ) æ 10 −4 ö
≤ 3
≤ 3
≡ Οçç ÷÷
8 pφ 4 pφ è p * ø

The second derivative is extremely small, on the order of 10-5 whereas an

iteration on a 1000 by 5000 size problem, with 10 processors, takes about .1 seconds.

p* is at least 1 and is usually over 10 for even relatively small problems. Rounding p*

to the nearest unit does not affect T in any significant way.

Sensitivity of timing to changes in s:

Assume we think that startup time is serr, we then calculate perr* based on serr

and run the problem with perr* processors. In fact the startup time is s (not serr). The

amount of time the program runs with the wrong perr* is Terr. We should have used p*

processors which would have given a running time of T.


66

T = T ( p * ( s ), s )
Terr = T ( p err * ( s err ), s )
T = 2 p * ( s + g ) + ( ρ + g )m + s from (1.1)
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
p err
γ n + π mn
Terr − T = + ( s + g ) p err − 2 p * ( s + g )
p err
γ n + π mn
p err = Þ
s err + g
γ n + π mn γ n + π mn γ n + π mn
Terr − T = + (s + g ) −2 (s + g )
γ n + π mn s err + g s+g
s err + g
æ æ 1 2 ö÷ ö÷
= γ n + π mn ç s err + g + (s + g )ç −
ç ç s +g s + g ÷ø ÷ø
è è err
æ s+g ö
= γ n + π mn ç s err + g + −2 s+ g÷
ç s err + g ÷
è ø
æ ( s err + g ) + ( s + g ) − 2 s + g s err + g ö
= γ n + π mn ç ÷
ç s err + g ÷
è ø

=
γ n + π mn
s err + g
( s + g − s err + g )
2

Terr − T = p err * ( s + g − s err + g )


2

Sensitivity of timing to changes in π:

Assume we think that a double floating-point multiplication time is πerr, we


then calculate perr* based on πerr and run the problem with perr* processors. In fact
floating-point multiplication time is π (not πerr). The amount of time the program runs
with the wrong perr* is Terr. We should have used p* processors which would give a
running time of T.
67

T = T ( p * (π ), π )
Terr = T ( p err * (π err ), π )
γ n + π mn
p* = Þ γ n + π mn = p *2 ( s + g ) (1.1)
s+g
γ n + π err mn
perr * = Þ γ n + π err mn = p *2 ( s + g ) (1.2)
s+g
γ n + π mn − γ n + π err mn
p-perr = Þ
s+g
p 2 ( s + g ) − perr 2( s + g ) = π mn − π err mn Þ
2( s + g )( p − perr )( p + p err ) = 2mn(π − π err ) Þ
2mn(π − π err )
2( s + g )( p − p err ) = (1.3)
( p + p err )

γ n + π mn
T= + ( ρ + g )m + ( s + g ) p + s
p
γ n + π mn
Terr = + ( ρ + g )m + ( s + g ) p err + s
perr
æ1 1 ö π mn π err mn
T − Terr = γ nçç − ÷÷ + − + ( s + g )( p − p err )
è p p err ø p p err
= ( s + g )( p − p err + p − perr )
= 2( p − p err )( s + g )
2mn(π − π err )
= from (1.3)
( p + p err )
68

Taylor series for π :


1

∂p 1 æ γ n + π mn ö 2 æ mn ö
= ç ÷ çç ÷÷
∂π 2 çè s + g ÷ø ès+gø
1 æ mn ö
= ç ÷ (1.6)
2 p çè s + g ÷ø
∂p
pmn − (γ n + π φ mn)
∂T ∂π + ( s + g ) ∂p
=
∂π p2 ∂π
1
mn
pmn − p 2 ( s + g )
2 p(s + g ) ( s + g )mn
= + from (1.6)
p 2
2 p(s + g )
pmn
pmn −
= 2 + mn
2
p 2p
mn
= (1.4)
p

∂p
− mn
∂ 2T ∂π = − mn mn
= from (1.6)
∂π 2 p 2
p 2 p( s + g )
2

− (mn) 2
=
2 p 2 (s + g )
− (mn) 2
= from (1.1 ) and (1.5)
2(γ n + π mn)
∂T 1 ∂ 2T
T − Terr = (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2

where π φ is some point in between π andπ err


mn 1 (mn) 2
= (π err )(π − π err ) + (π φ )(π − π err ) 2 from (1.4) and (1.5)
perr 4 2(γ n + π φ mn)
69

Taylor series for p when π changes :


∂2 p 1 æ 1 ö (mn) 2
= − ç ÷ <0 (1.7)
∂π 2 4 çè p 2 ÷ø ( s + g ) 2
∂p 1 ∂2 p
p − p err = (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2

1 æ mn ö
= ç ÷(π − π err ) from (1.6) and (1.7)
2 p err çè s + g ÷ø
∂p 1 ∂2 p
p − p err < (π err )(π − π err ) + (π φ )(π − π err ) 2
∂π 2 ∂π 2

Graphs showing the sensitivity of both p* and Time per iteration to both changes in s
and changes in π.

Table 5.7 shows what happens as the error in startup time (s) increases. The

correct s value is in the middle of the table in italics. It has a 0% error. Both p* and

time per iteration are shown for each error in the last two columns. Figure 5.11 and

Figure 5.12 graph p* and time per iteration, respectively, for the percentage errors in

s. The correct s value is in the center of the horizontal axis at 0% error. As you move

to the right the error assigns s too high a speed. As you move to the left the error

assigns s too low a speed. Note that the time per iteration increases in whichever

direction we err and irrespective of whether p* increases or decreases. This is because

we are no longer using the optimum p*.

Table 5.8 shows what happens as the error in pivot time (π) increases Figures

5.13 and 5.14 correspond to Table 5.8. The analysis of the last paragraph for s

correspondingly applies to errors in π.

A problem of 1,000 by 10,000 was used for these tables.

Note that a 30% change in π gives a 10% error in time per iteration and a 40%

change in s gives a 10% error in time per iteration.


70

s % error p* T
2.694E-04 -40% 67.75 0.0343 0.0360

2.501E-04 -30% 70.29 0.0341 0.0355


2.309E-04 -20% 73.14 0.0340
2.116E-04 -10% 76.37 0.0339 0.0350

1.924E-04 0% 80.07 0.0338 0.0345


1.732E-04 10% 84.37 0.0339 T
1.539E-04 20% 89.44 0.0340 0.0340

1.347E-04 30% 95.55 0.0343

time per iteration


0.0335
1.154E-04 40% 103.11 0.0348
9.620E-05 50% 112.80 0.0356 0.0330

Table 5.7-p* and T as a function of relative error in s 0.0325

%
%
%
%
%
%
%
%
%

0
0
0
0
0%
10
20
30
40
50

-4
-3
-2
-1
% error in s

Figure 5.11 – Time per iteration as a function of relative error in s

120.00

100.00

80.00

60.00 p*
p*

40.00

20.00

-
%
%
%
%
%

0%
0%
0%
0%
0%
10
20
30
40
50

-4
-3
-2
-1

% erro r in s

Figure 5.12 - p* as a function of relative error in s


71

0 .0 7 8 0
π % error p* T
0 .0 7 7 0
1.740E-07 -40% 189.52 0.0737
1.616E-07 -30% 182.62 0.0734 0 .0 7 6 0

1.492E-07 -20% 175.46 0.0731 0 .0 7 5 0


1.367E-07 -10% 167.99 0.0730 0 .0 7 4 0 T
1.243E-07 0% 160.17 0.0729
0 .0 7 3 0
1.119E-07 10% 151.95 0.0730

time per iteration


0 .0 7 2 0
9.945E-08 20% 143.26 0.0733
8.702E-08 30% 134.01 0.0739 0 .0 7 1 0

7.459E-08 40% 124.07 0.0750 0 .0 7 0 0 0%


6.216E-08 50% 113.26 0.0767
10%
20%
30%
40%
50%

-40%
-30%
-20%
-10%

Table 5.8- p* and T as a function of relative error in s % ε ρ ρ ο ρ ιν π

Figure 5.13 - Time per iteration as a function of relative error in π

200.00
180.00
160.00
140.00
120.00
100.00 p*
p*

80.00
60.00
40.00
20.00
-
%
%

0%
0%
0%
20
40

-4
-2

% ερρορ ιν π

Figure 5.14 – p* as a function of relative error in π


72

6. Implementation Choices

In order to implement the distributed algorithm, a communication package

was necessary. A communication package is basically a library with functions for

performing parallel operations. The packages we considered can be used with most

programming languages. There are also a number of language-specific parallel

programming languages.

6.1 Distributed programming software

In order to implement our parallel method we examined a number of

distributed programming packages. We focused mainly on PVM, MPI [Geist, 1996],

and BSP [Goudreau et al, 1999]. The package to be chosen had to be able to run on a

network of workstations, not just a Massive Parallel Processor (MPP). Another

concern was the ease of use and the portability. Geist [1996] points strongly to PVM.

He claims that MPI has a rich set of functions for point-to-point and collective

communication, but it does not run well on heterogeneous networks. PVM, on the

other hand, is built with the "virtual machine" concept in mind. It should work in a

heterogeneous environment and could handle dynamic process creation. On the other

hand, PVM has greater overhead and will under-perform MPI on an MPP and even

on a homogeneous network of workstations. If there are many small messages, the

overhead is multiplied.

Another package is BSP (Bulk–Synchronous Parallel). This is a model that

also has implementations coded. Our application might work with this because it can

be synchronized at certain points. That is a main feature in BSP. We ruled out BSP

because it is not widely used.


73

6.2 Sockets

On a lower level of communication are socket calls. The distributed

programming software that was just mentioned in fact makes use of socket function

calls. There are two categories of sockets. One of the categories, TCP sockets, has

built in error checking. It employs a three-way handshake and insures that packets are

received in order. This is the category that is used by the distributed programming

software. TCP sockets cannot take advantage of the Ethernet’s broadcast facility. The

other socket category is known as UDP. This category is not used by the packages but

does allow the Ethernets broadcast facility to be used. More information on socket

programming can be found in [Comer and Stevens, 1996].

6.3 Reasons for choice of both sockets and MPI

Our application does not dynamically allocate processes and it does send

many small messages in the column selection process. We assume it will be run on a

homogeneous UNIX network. This suggests MPI over the other distributed parallel

packages. MPI is also one of the standard packages used and was available to us. If

our network would have processors with different speeds, PVM would be a little

better, although for both of them load balancing would have to be handled by our

program.

MPI is a communication protocol. It specifies communication functions and

what they must do. One implementation is called LAM (Local Area Multi-computer),

which is an MPI environment that allows multiple workstations to work together in

solving one problem [Ohio supercomputer center, 1995]. LAM/MPI is the

implementation that we used; it is given as a library add-on to the programming


74

language. It includes communication functions that the processors on the network can

use to communicate. Two of the functions we use are Allreduce and Bcast. These

were explained in Section 5.3.1 in the context of our method’s communication needs;

its implementation in LAM is described in Section 6.4.

Unfortunately, we were not able to find an implementation of MPI that makes

effective use of the broadcast capabilities of Ethernets. This feature of Ethernets is

essential to the performance of our distributed algorithm.

UDP sockets, on the other hand, allow us to use the Ethernet’s broadcast

capability. This makes a major difference in the scalability of our program. Figures

5.6 and 5.7 show the difference in performance. UDP sockets can safely be used on

local Ethernets where the hardware should deliver the packets in order. The error

checking that is left out in UDP is not necessary on a local Ethernet [Comer and

Stevens, 1996 pg. 13]. MPI is still useful on the larger networks where UDP sockets

cannot be used. We use sockets for empirical testing since they can take advantage of

the Ethernet broadcast, even though both the sockets and the MPI versions were

implemented.

For this reason we decided to directly use socket functions in place of the two

MPI commands. It is interesting to note that Bruck et al [1995] has implemented a

communication package that does take advantage of broadcast.

6.4 The specific commands used

Even though we used sockets, the MPI terminology is still useful. The simplex

method consists, primarily, of one loop. Section 2.1 described the serial simplex
75

tableau method and Sections 5.1-5.2 gave a sketch of the steps for the parallel tableau

method. It was pointed out that within that one loop there are two communications.

First there is computation period followed by a communication period that

gets the maximum bid from each of the processors from its columns and broadcasts

the maximum of these and who the “winning” processor is to all the processors - an

MPI_Allreduce in MPI terminology. After that there is computation by one processor

(row choice). In the second communication, the winning processor broadcasts the

winning column to the rest of the processors; an MPI_Bcast in MPI terminology.

After that there is computation by all the processors.

6.5 A brief description of MINOS, a revised simplex implementation

MINOS is an implementation of the revised simplex method developed at

Stanford University [Murtagh and Saunders, 1983-1998]. It takes as input linear and

nonlinear programs in the MPS format. Our experiments, detailed in Section 7, relied

on comparing our method with the revised method; we used MINOS as our

representative of the revised method. It is important to note that we are comparing our

algorithm with the revised method in general. It is difficult to directly compare it with

MINOS when implementations of the revised method vary widely based on heuristic

differences. This is explained in more detail in Section 7.2.3. MINOS is often used

for research; for one reason, its source code is available. In our experiments we used

version 5.5.

We will detail a few of the MINOS settings we used to help us compare

MINOS to our program. More comprehensive details of how to use MINOS can be

found in the MINOS user's guide [Murtagh and Saunders, 1998]. MINOS takes two
76

files as input: An MPS data file and a specification file. The specification file tells

MINOS the features and parameter values to use when solving the problem. MINOS

defaults to using a crash procedure to get an initial basis [MINOS User’s Guide

Chapter 3]. Our code does not. In order to make comparisons more direct we disabled

it in MINOS, using "CRASH OPTION 0" in the spec file. MINOS will now simply

choose all the slack variables as the basis. We also set "SCALE OPTION 0" so the

problem would not be scaled. This is important because our program and MINOS

have different scaling methods. Below is the MINOS specification file that we used.

BEGIN general
PARTIAL PRICE 1
SCALE OPTION 0
CRASH OPTION 0
MPS FILE 10
PRINT LEVEL 1
PRINT FREQUENCY 1
SUMMARY FREQUENCY 1
END general

MINOS uses Partial Pricing. As noted in Section 3.2, the revised benefits

from a large n to m ratio. Another way the revised can reduce computations is by

avoiding the pricing out of every column. Instead of getting the best value from every

column one can simply choose from a subset of the columns. This is called partial

pricing as opposed to full pricing. By default MINOS will use partial pricing when n

is large. If n is at least 1000 or if n is 4 times m, partial pricing will be used [Murtagh

and Saunders, 1998 Ch. 3]. In order to make comparison more direct we disabled

partial pricing with the line "PARTIAL PRICE 1" so that all columns are searched to

find the entering column.


77

The revised method makes extensive use of reinversions. There are three

reasons for reinversions: a) numerical stability b) to support some degeneracy

procedures [Gill et al, 1987] and c) refactorization of matrices used in the revised

method [Chvátal, 1983]. A full tableau method would only do a reinversion for the

first two reasons. Reinversions for the first two reasons are executed infrequently

whereas refactorizations are quite frequent. Our serial algorithm has a “refresh”

procedure built in. This was usually disabled for the purposes of experimentation

although see Figure 7.9.


78

7. Computational Experiments

Test runs were performed on a number of synthetic problems with matrices of

varying sizes, aspect ratios and densities using our serial method, our parallel method

and MINOS. Section 7.1.1 discusses these synthetic problems. Tests were also

performed on problems in the Netlib library (see Section 7.1.2). These tests were also

used to validate the models and to compare our standard method with the revised

method.

7.1 Problems used for experimentation

For use in experimentation, we needed problems of specific sizes and

densities. At the same time we wanted to use more realistic problems. These

objectives were accomplished by writing our own Synthetic linear program

generators and by also using the Netlib library [http://www.netlib.org/lp].

7.1.1 Synthetic linear programs

We developed three LP generators: generator, generator1, and generator2.

All of them take as input, m = number of rows, n = number of columns, s = the

density of the non-zero coefficients (0 < d ≤ 1), and seed = the seed for the random

number generator; in addition the user specifies a file descriptor for the mps output,

and a problem name.

generator

All the constraints are of the less than equal type. Whether a constraint

coefficient is positive (or zero) is determined randomly with probability s. The value

of a non-zero coefficient is chosen uniformly between 0 and 1. The right hand side

coefficients are all 1. The objective coefficients are all -1, with the exception of those

corresponding to columns that, by chance, end up with all 0 coefficients. In this case
79

the corresponding objective coefficient is set to +1. This prevents unbounded

solutions. Thus excluding these zero columns, we seek to maximize the sum of the

variables. The initial solution determined by all the variables = 0 is feasible so that no

phase I procedure is necessary. There is no guarantee (except the law of large

numbers) that the actual density of the problem is exactly or even close to s. The

program does report the actual density. The output is an mps format file defining the

generated LP.

A major problem with generated problem instances (synthetic problems) is

that there may be covert, underlying structure that makes the problem much more

special than problems that might appear in practice. This was recognized early on by

Kuhn and Quandt [1963]. They proved a theorem that applies to generator, which

gives an asymptotic estimate of the value of the objective. Luby and Nisan [1993]

also gave a fast parallel, approximate algorithm that applies to non-negative

problems; this also applies to the problems generated by generator. Thus there are

obviously special features of this class of problems, which make them easier.

Fortunately, this is revealed more by the number of iterations than the work per

iteration. Since our methods apply to savings within iterations, these considerations

should not affect our results.

generator1

The constraints are generated as in generator, and they are also all less than or

equal constraints. The objective coefficients are now generated randomly between -1

and -0.5. If the column has all zero coefficients in the constraints the sign is reversed.

The right hand side coefficients are also generated randomly, uniformly in the range
80

0.5 to 1. The Kuhn-Quandt Theorem no longer applies, but the Luby-Nisan Algorithm

does.

generator2

In this generator we again have less than or equal constraints. The non-zero

matrix elements are generated uniformly between -1 and 1. The objective coefficients

are generated randomly between -1 and 1. The variables are constrained to be

between -m and m. The constraints are constrained to range between -1 and 1. Again,

setting all variables to 0 is feasible - no Phase 1 is required. Neither the Kuhn-Quandt

Theorem nor the Luby-Nisan Algorithm apply to the results of this generator. Notice

that this (and only this) generator requires the RANGE feature of the solver. The

RANGE feature provides for upper and lower bounds of the constraints as well as the

variables.

Figure 7.1 shows the total time as density increases for the three generators.

Figure 7.2 shows the time per iteration as density increases for the three generators.

Note that the total running times vary widely for the three types of generators while

the times per iteration are very close. This supports our view that the type of synthetic

problems used affect total running time more than the time per iteration.
81

T im e b y G en erato r v s D en sity

40 0

35 0

30 0

25 0

20 0

Time (secs.)
15 0

10 0

50

0
0 0 .2 0 .4 0. 6 0. 8 1 1.2
D en s ity

G e nera tor G en era tor1 G e nera tor2

Figure 7.1 – Total time by generator vs. density


82

Time per Iteration by Generator vs. Density

0.033

0.032

0.031

0.03

Time per Iteration (Secs.)


0.029

0.028

0.027
0 0.2 0.4 0.6 0.8 1 1.2
Density

Generator Generator1 Generator2

Figure 7.2 – Time per iteration by generator vs. density


83

7.1.2 Netlib Problems

Netlib contains a repository of difficult linear programming problems

[www.netlib.org/lp/data, 1996]. These problems are often used as benchmarks for

testing linear programming code. The Netlib problems are in general sparse. Table

7.1 contains a listing of the Netlib problems sorted by their density.


84

Table 7.1: Netlib problems sorted by density


Name Rows Cols Nonzeros density

FIT1D 25 1026 14430 56.26%


FIT2D 26 10500 138018 50.56%
KB2 44 41 291 16.13%
W OOD1P 245 2594 70216 11.05%
AFIRO 28 32 88 9.82%
SHARE2B 97 79 730 9.53%
ISRAEL 175 142 2358 9.49%
ADLITTLE 57 97 465 8.41%
BLEND 75 83 521 8.37%
BEACONFD 174 262 3476 7.62%
FORPLAN 162 421 4916 7.21%
GROW 7 141 301 2633 6.20%
BOEING2 167 143 1339 5.61%
SC50A 51 48 131 5.35%
SCSD1 78 760 3148 5.31%
SC50B 51 48 119 4.86%
RECIPE 92 180 752 4.54%
SHARE1B 118 225 1182 4.45%
E226 224 282 2767 4.38%
BRANDY 221 249 2150 3.91%
STOCFOR1 118 111 474 3.62%
AGG 489 163 2541 3.19%
SCAGR7 130 140 553 3.04%
GROW 15 301 645 5665 2.92%
AGG3 517 302 4531 2.90%
AGG2 517 302 4515 2.89%
BOEING1 351 384 3865 2.87%
SCSD6 148 1350 5666 2.84%
SC105 106 103 281 2.57%
STAIR 357 467 3857 2.31%
TUFF 334 587 4523 2.31%
LOTFI 154 308 1086 2.29%
VTP.BASE 199 203 914 2.26%
BORE3D 234 315 1525 2.07%
GROW 22 441 946 8318 1.99%
DEGEN2 445 534 4449 1.87%
CAPRI 272 353 1786 1.86%
BANDM 306 472 2659 1.84%
SCFXM1 331 457 2612 1.73%
D6CUBE 416 6184 43888 1.71%
SCTAP1 301 480 2052 1.42%
FFFFF800 525 854 6235 1.39%
SC205 206 203 552 1.32%
PILOT4 411 1000 5145 1.25%
SCORPION 389 358 1708 1.23%
SCSD8 398 2750 11334 1.04%
FIT1P 628 1677 10894 1.03%
SHIP04L 403 2118 8450 0.99%
SHIP04S 403 1458 5810 0.99%
85

DEGEN3 1504 1818 26230 0.96%


SEBA 516 1028 4874 0.92%
ETAMACRO 401 688 2489 0.90%
FINNIS 498 614 2714 0.89%
SCFXM2 661 914 5229 0.87%
25FV47 822 1571 11127 0.86%
SCAGR25 472 500 2029 0.86%
PILOT 1442 3652 43220 0.82%
MAROS 847 1443 10006 0.82%
BNL1 644 1175 6129 0.81%
PILOT.JA 941 1988 14706 0.79%
STANDATA 360 1075 3038 0.79%
PILOT87 2031 4883 73804 0.74%
STANDGUB 362 1184 3147 0.73%
STANDMPS 468 1075 3686 0.73%
NESM 663 2923 13988 0.72%
SCRS8 491 1169 4029 0.70%
PEROLD 626 1376 6026 0.70%
PILOTNOV 976 2172 13129 0.62%
SCFXM3 991 1371 7846 0.58%
QAP8 913 1632 8304 0.56%
GFRD-PNC 617 1092 3467 0.51%
SHELL 537 1775 4900 0.51%
SHIP08L 779 4283 17085 0.51%
MAROS-R7 3137 9408 151120 0.51%
SHIP08S 779 2387 9501 0.51%
PILOT.W E 723 2789 9218 0.46%
CZPROB 930 3523 14173 0.43%
TRUSS 1001 8806 36642 0.42%
W OODW 1099 8405 37478 0.41%
SCTAP2 1091 1880 8124 0.40%
CYCLE 1904 2857 21322 0.39%
MODSZK1 688 1620 4158 0.37%
SIERRA 1228 2036 9252 0.37%
SHIP12L 1152 5427 21597 0.35%
SHIP12S 1152 2763 10941 0.34%
GANGES 1310 1681 7021 0.32%
D2Q06C 2172 5167 35674 0.32%
SCTAP3 1481 2480 10734 0.29%
GREENBEA 2393 5405 31499 0.24%
GREENBEB 2393 5405 31499 0.24%
STOCFOR2 2158 2031 9492 0.22%
BNL2 2325 3489 16124 0.20%
QAP12 3193 8856 44244 0.16%
FIT2P 3001 13525 60784 0.15%
80BAU3B 2263 9799 29063 0.13%
QAP15 6331 22275 110700 0.08%
DFL001 6072 12230 41873 0.06%
STOCFOR3 16676 15695 74004 0.03%

Table 7.1 - Netlib problems sorted by density


86

7.2 Validation of Performance Models

7.2.1 Computation verification

Computation in dpLP (Dantzig rule)

Section 5 provided running time projections for our serial and parallel

programs. We used our models to pick the optimal number of processors to use. We

then were able to compare the running times of both our parallel and serial algorithms

with the revised simplex algorithm and to characterize which types of problems our

methods are good for. This analysis shows the advantages of our dpLP parallel

program for all problem sizes. This was discussed in Section 5. The parallel

program’s expression had a computation part and a communication part. We also had

a separate computation expression for the steepest edge column choice rule. In this

section we validate those expressions by comparing the actual running times for

linear programs with our projections.

In order to verify the estimations given in Section 5, we first have to estimate

the coefficients of the terms in our expressions. These coefficients would vary with

the computing environment (machines and network). In Section 5 we gave three

expressions that can be verified in our environment. Two are for computation, one for

the standard column choice rule and one for the steepest edge rule. One is for

communication, which doesn’t depend on the column choice rule.

Similarly, for our serial full tableau program we have two expressions and for

the revised (MINOS) program we have one expression.

The coefficients required for the computation expressions are a) column

choice time per unit vector element (ucol), b) row choice time per unit vector element

(urow), and c) pivot time per unit matrix element (upiv). These constants are defined
87

in Section 5.5. The constant terms required for the communication expressions are s

and g. An explanation of these constants is also given in Section 5.5.

There are, in general, two methods that we employed to get the coefficients.

One is by directly estimating those coefficients. The second method applies linear

regression to actual runs of the linear programming code to estimate the values of the

coefficients. If the regression produces a tight fit we can be confident that the

coefficients are accurate and that the expression correctly estimates the running times

of the programs.

To directly estimate a coefficient we time the specific code segment

corresponding to that coefficient. We then divide the time by the variables that are

multiplied by it in the expression. For example, in order to find upiv we time the

pivot function call. We then divide it by mn since mn is multiplied by upiv in the

expression.

To verify the expressions, we generated a series of dense problems. We used

these problems together with problems from the Netlib library. In order to verify the

parallel dplp expressions the problems were run in parallel using multiple processors.

The smallest number of processors used was 2 and the largest was 7.

Figure 7.3 plots time per iteration against mn for the standard column choice

rule. Figure 7.4 is a similar graph for the steepest edge rule and is explained later in

this section. In Figure 7.3 the coefficient upiv dominates, especially for large

problems. This is because the pivot step in fact takes about 95% of the computation

time. The other two coefficients can actually be left out of the expression. One can

see from the figure that as mn grows so does the running time. The points of the
88

actual running times almost completely lie on the projected value line. This verifies

that the running time is almost completely dependent on mn. Since mn is the pivot

term of the expression, it justifies our leaving out the other terms. In fact, regression

was only used to find out the value of upiv. The other coefficients were not accurately

estimated when using regression, probably due to the fact that upiv dominates the

other coefficients.

Below are the values obtained using the direct timing of the 3 steps of an

iteration. These values come from Table 5.1 and 5.2 in Section 5.4. We also include

upiv as estimated using regression even though its value was not used in the formulas.

ucol_se is the unit cost for the steepest edge column choice rule. This value is

only used in the steepest edge verification further in this section.

Ucol 3.73E-8

Urow 1.65E-06

Upiv 1.24E-07

upiv (regression) 1.27E-07

ucol_se 3.47E-07

Only the pivot coefficient and term of the expression are used to estimate the

timing. The other terms are negligible. The estimation was applied to a number of

problems in the Netlib library in addition to generated problems. The relative

percentage error of our estimate to the observed timing was calculated by

obsevation − estimate
100 .
estimate

The mean percentage relative error observed amongst these problems was

5.00%. It is important to note that most large problems had a relative percentage error
89

of less than one percent. Unfortunately a few of the smaller problems gave larger

errors, which pulled the average up. The maximum relative error was 19.25%. As we

explained, the time taken during the pivot step takes the main bulk of time. The time

taken for other steps are relatively insignificant. For small problems, that assumption

is not as strong which can cause a larger relative error.


90

1.4

1.2

0.8

0.6

Iteration time (s)


0.4

0.2

0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000
mn

Actual Iteration times projected iteration times

Figure 7.3 – Iteration time vs. mn (classical column choice rule)


91

4.5

3.5

2.5

Iteration time (s)


1.5

0.5

0
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000

mn

Actual iteration times Projection when 100% SE columns looked at Projection when 0% SE columns looked at

Figure 7.4 - Iteration time vs. mn+αmn (steepest edge column choice rule)
92

Serial program (retroLP) verification

The serial time expression is essentially the same as our parallel expression.

The only difference is that it uses only one processor. We used the same coefficients

obtained for the parallel program’s expression for the serial expression. We executed

the serial program for the same group of problems described above. We then took the

average error between the estimation and its actual running time. Our serial program

gave 15.34% and 7.73% for the maximum and average relative percentage errors

respectively. Again the small problems pulled up the mean relative percentage error.

Steepest Edge verification

The expression for the steepest edge rule is different from the computation

expression only in the column choice part. Instead of having to do m comparisons we

now must do at most mn multiplications; m multiplications for each of the eligible

columns (see Section 3.3). This could roughly double the number of significant

operations compared to using the standard column choice rule. Based on Table 5.2

this will actually cost, on our network, between three and four times the total

computation time per iteration compared to using the standard column choice rule.

This assumes that all the columns are eligible. In practice, however, many columns

are not eligible. In one empirical test we found that only 35% of the columns were

eligible; the other columns could be immediately eliminated. The cost of an iteration

is therefore upper bounded by twice the number of operations and between three and

four times the time cost of an iteration (on our network) when the standard column

choice rule is used. This upper bound is rarely reached. For the steepest edge column

choice rule, therefore, the se_ucol coefficient is also significant. Note that this

coefficient is different than the ucol coefficient in the standard column choice rule
93

discussion. se_ucol here represents the unit cost of doing the steepest edge column

choice rule assuming all columns are looked at. The value of se_ucol was listed

earlier in this section.

In order to accurately estimate the running time of the program when using

steepest edge we must know the percentage of columns actually looked at during the

program. This percentage would then be multiplied by ucol. This is not known before

a program is solved and we therefore cannot accurately estimate the running time in

advance. It can nevertheless be used to show the accuracy of the expressions as in

Figure 7.3. Furthermore if we use the generic 35% number mentioned above we do

get a reasonably good estimate of the running time. Assuming we know the number

of columns actually looked at during execution, 17.72% and 8.22% are the maximum

and relative percentage errors respectively.

A graph similar to the one shown for the standard column choice rule is

provided in Figure 7.4. The horizontal axis, as in Figure 7.3, is the problem size mn.

Figure 7.4 shows two lines. The top line corresponds to problems where the program

looks at every column within the steepest edge column choice rule. The bottom line

corresponds to a problem where the program looks at none of the columns within the

steepest edge column choice rule. In practice a percentage of the columns are looked

at as we just explained. Note that the actual running times per iteration fall in between

these projected lines. This verifies our steepest edge rule timing projection.

Graphs of total running times are provided in Section 7.3.

7.2.2 Communication verification (using regression for coefficients)

Communication time

We now discuss the following issues:


94

i) User time vs wall clock time

ii) Regression vs direct timing

iii) Network idle time

iv) Socket barrier commands vs. MPI_Barrier to separate wait_time and

communication time

v) Separation of communication time from wait time

vi) Timing with both 2 processors and with up to 7 processors.

As in computation, there are two general ways of estimating the

communication coefficients s and g. One is to use s and g calculated by

experimentation using measurement programs. The other is to use regression on

timings of the actual LP programs.

The timing for communication is the wall clock time. It is important to run

communication tests during network idle time to avoid time accruing from other

processes running.

Another issue is to make sure that the communication times are accurate for

more than two communicating processors. To this end we estimated s and g in the

context of broadcast and all reduce. This verified the accuracy of s and g even when

more than two processors were involved in the communication.

Communication time can be divided into two parts. First, before the first

message can be read, the reading processor might be waiting for the sender to finish

its computation. This is referred to as wait time. The second part is the actual

communication time. We initially divided the two by putting an MPI_Barrier

command before the timing of the communication within the program. The only need
95

for an explicit barrier command is for this particular timing test. This command

separates the effect of processors waiting for other processors from the

communication itself. The comparison would then be on the communication time

without the effects of wait time.

The MPI_Barrier command itself has significant overhead. It actually adds 7

ms to the wait time. We timed this by putting a number of barrier commands inside a

loop.

Instead of MPI’s barrier command, a sequence of read and send commands in

the socket interface was substituted. This surprisingly decreased not only the wait

time but also the communication time.

It is unclear why there was a decrease in the communication time. It might be

because after the processor enters the barrier, it lets the processors leave at slightly

different times. The new socket barrier method seems to take away most of the

overhead the MPI barrier had. For the problems tested, the wait time plus the

communication time were almost the same as the communication time that was

obtained when there was no barrier statement.

Table 7.2 compares percentage errors in two groups of problems. The first is a

group of 24 problems each of which was executed using 2 processors. The wait time

was separated from the communication time by use of socket calls. The second group

used the same problems as the first group. This group, though, was executed once

using 2 processors and then with 3 processors… all the way up to 7 processors. This

gives a total of 144 runs. The table contains both the maximum and average relative

percentage error for these two problem groups. The rows correspond to s and g values
96

derived from different sources. The first row shows the s and g that result from direct

experimentation. The bottom two rows show the s and g that result from regression.

Table 7.3 lists the 24 problems that were used with their sizes and densities. The first

10 problems, with names beginning with “d” are synthetic problems generated by

generator (the first one). Note however, that for this experiment the densities are

irrelevant since we are not using the revised method.

Only communication time; no computation or wait time.

Using socket barrier method; no MPI_Barrier.

max % off, avg % off data1 (2 p) data2(2 to 7 p)

From experiments s1, g1 1.36E-04 273.41% 170.65%

1.61E-07 146.97% 77.47%

regression on 2 p with socket Barrier s3, g3 0.000192 31.66% 33.28%

1.50E-06 12.67% 12.96%

regression on 2-7 p with socket Barrier s4, g4 0.0002 64.45% 28.75%

9.11E-07 24.94% 4.70%

Table 7.2- percentage errors in both groups of problems.


97

name m n d

d500x5000 500 5000 100%


d100x4000 100 4000 100%
d200x2000 200 2000 100%
d100x3000 100 3000 100%
d100x2000 100 2000 100%
d100x1000 100 1000 100%
d100x1000 100 1000 100%
d100x500 100 500 100%
d100x100 100 100 100%
d10x100 10 100 100%
share2b 96 79 9.53%
share1b 117 225 4.45%
stair 356 467 2.31%
ship04l 402 2118 0.99%
ship04s 402 1458 0.99%
standata 359 1075 0.79%
standmps 467 1075 0.73%
standgub 361 1184 0.73%
shell 536 1775 0.51%
ship08l 778 4283 0.51%
ship08s 778 2387 0.51%
woodw 1098 8405 0.41%
ship12l 1151 5427 0.35%
ship12s 1151 2763 0.34%

Table 7.3-The 24 problems used

These tests were repeated on the large problems. (Four of the 24 problems

were excluded.) In this set of 20 problems one had 50,000 matrix elements, and the

other 19 had at least 100,000 elements. The results were virtually the same (within .5

of a percent) as when we had all 24 problems.

Wait time

Wait time is the time that processors spend at a synchronization point waiting

for other processors to finish computation. In general this time should be short if the

load is evenly distributed amongst the processors. This waiting time is actually a

function of the computation parameters m, n, and p. The longer the computation, the

longer two different processors might vary in their computation time. For the classical

column choice rule, the large cost of pivoting is what causes most of the wait time.
98

The classical column choice rule is insignificant in terms of time. In the steepest edge

rule, both the cost of pivoting and column choice heavily contribute. Only one

processor does the row choice, which is why it does not contribute to wait time but is

instead considered computation time.

We timed many pivots on a constant size dense matrix. We found very small

random discrepancies in time between the pivots. Each pivot step does the same

number of calculations. Since the discrepancies were very small and random, we

assume it comes from something random within the computer. For each iteration the

processors must wait for the slowest pivot. This wait time of one iteration should be

equal to the maximum pivot time of the processors minus the minimum pivot time of

the processors. Sum this per-iteration wait time over all iterations. This sum should be

the total wait time.

In our small problems, the computation time is much larger than the

communication time. As a result, the wait time is greater than the communication

time. This should change as the optimal number of processors is reached. At that

point, communication will be approximately as costly as computation. The

communication time will then be much larger than the wait time.

7.2.3 Analysis of the revised program (MINOS) expression

The revised method has several variants. They all go through the same basic

steps that use the inverse of the basis or some functionally equivalent representation.

For a more detailed discussion see [Nash, 1996] and [Chvátal, 1983]. The basic steps

are as follows:

A. Pricing out the c (objective) vector.

B. Choosing an entering basic variable.


99

C. Constructing the entering column.

D. Choosing the pivot row

E. Updating the inverse or its functional equivalent.

Steps A and C make use of the “basis inverse” while step E keeps the “basis

inverse” current. Step E executed every iteration for the case of the explicit inverse. If

the inverse is stored as a factorization made up of other simple matrices, it is only

executed once every number of iterations. These “simple” matrices correspond to

some triangular matrix decomposition of the basis matrix such as LU decomposition.

In the latter case, step E is known as a refactorization and can cost up to m3

depending on sparsity, as explained below. When an explicit inverse is maintained,

step E has a cost of about m2, where m is the rank of the basis.

A procedure similar to a refactorization, which we call refresh is sometimes

performed even in the explicit inverse for the sake of numerical accuracy. Refresh is

much more infrequent and is not discussed here.

A very big factor in the running time of the revised method is sparsity. There

are two types of sparsity. The first is the sparsity of the original data. The second is

the sparsity of the inverse or its equivalent. Fill-in is the term used when the “inverse”

representation starts accumulating nonzeros.

Steps A and C can always take advantage of the first type of sparsity. The

explicit inverse representation of the revised can only be expected to take advantage

of the first type of sparsity. This is because there is only one inverse and in general

the inverse of a matrix will be dense even for a sparse matrix. On the other hand,

there are many possible factorizations of a matrix. This allows a “smart” factorizing
100

construction to choose one with very little fill-in. This is implemented by heuristically

choosing pivot elements that result in a sparse factorization. This allows inverse

factorization methods to take advantage of both forms of sparsity.

The Markowitz ordering scheme used in MINOS is an example of this. Steps

A, C and E, in these schemes, can take advantage of the second type of sparsity. Eta

factorization and triangular factorization are two ways of factorizing the inverse.

MINOS uses triangular factorization. It adds Eta vectors for each pivot until the next

refactorization.

The MINOS User’s guide says [Murtagh and Saunders, 1998]: “MINOS

maintains a sparse LU factorization of basis matrix B using a Markowitz ordering

scheme and Bartels-Golub updates, as implemented in the LUSOL package of Gill et

al [1987]. For a description of the concepts involved see Reid [1976, 1982]. The basis

factorization is central to the efficient handling of sparse linear and nonlinear

constraints.“ MINOS therefore takes advantage of both types of sparsity.

MINOS uses many heuristics to speed up computation. One heuristic is the

“LU density tolerance.” It changes the refactorization based on density. MINOS 5.5

defaults to refactorizing every 100 iterations. This can be changed. It is difficult to

come up with a performance model for MINOS that takes all the heuristics into

account.

We can make an expression for the revised method that would take the first

type of sparsity into account. In Section 5 the graphs and discussion assumed an

expression that uses the explicit inverse form of the revised method. This is not the

way MINOS implements the revised but it’s close to the upper bound when the
101

second kind of sparsity is assumed not to occur. The sparsity in the functional

equivalent of the basis inverse is unknown before solving the problem. The revised

expression can theoretically be verified by going into the MINOS source code and

calculating the fill in that occurs every iteration, similar to what we did for the

steepest edge expression in our full tableau program. MINOS is not our code and we

didn’t do that.

We can though, show a comparison of dpLP to MINOS as the problem

density rises and as the number of processors rises. This is shown in the next section.

7.2.4 Revised vs. retroLP and dpLP

Figure 7.5 corresponds to Figure 5.3 and Figure 7.6 corresponds to Figure 5.6

and 5.7. Note that Figure 5.7 uses more processors than we have. Tables 7.4 and 7.5

correspond to Figures 7.5 and 7.6 respectively.

We ran a problem of size 1,000 by 5,000. For Figure 7.5 and Table 7.4 we

used 5% density. We stopped the program after 500 iterations. For those few that did

500
not run for that many iterations, we scaled the time it took by time . This gives
pivot

the time it would take for 500 iterations. Only 3 problems needed this.

Figure 7.5 and Table 7.4 compare MINOS and retroLP over varying densities.

We can see that for this problem, at somewhere between 70% and 80% density,

retroLP takes over MINOS in speed.

Figure 7.6 and Table 7.5 compare MINOS and dpLP on a problem with 5%

density. The number of processors is increased until 7. The optimum value is in fact

about 53 processors. MINOS takes 24.24 seconds whereas dpLP when run on 7

processors speeds up to 45.64 seconds. The computational model given in Section 5


102

predicts a time of 41.72 seconds for dpLP on 7 processors. That is an accurate

prediction. The same model predicts a running time of 11.73 seconds if dpLP would

be run over the optimal number of processors. From the graph, we can also see the

steady speedup as the number of processors increases. It was not leveling out at 7

processors.
103

450

time per 500 iterations


400
350
300
250
200
150
100
50
0
%

%
5%

10

20

40

50

60

70

80

90

0
10
Density
Revised-MINOS Serial-retroLP

Figure 7.5-Actual timing as a function of Density.

350
time per 500 iterations

300
250
200
150
100
50
0
1 2 3 4 5 6 7
Processors

Revised-MINOS at 5% Density Parallel-dpLP

Figure 7.6- Actual timing as a function of p


104

Density Revised-MINOS Serial-retroLP


5% 24.240 306.640
10% 43.630 306.640
20% 82.314 306.640
40% 148.213 306.640
50% 194.070 306.640
60% 242.720 306.640
70% 285.060 306.640
80% 323.440 306.640
90% 352.384 306.640
100% 397.718 306.640

Table 7.4- Actual timing as a function of Density

Processors (for dpLP) Revised-MINOS-5% Parallel-dpLP


1 24.240 306.640
2 24.240 155.747
3 24.240 108.617
4 24.240 77.479
5 24.240 65.574
6 24.240 53.287
7 24.240 45.638

Table 7.5- Actual timing as a function of p


105

7.3 Total Time Comparisons

As noted in Sections 4.1.2 and 5.11, one of the advantages of using a full

tableau parallel algorithm is the ability to take advantage of more complicated column

choice rules. Figure 7.7, “retroLP vs. MINOS”, shows total running time as a function

of density for problems with m=500 and n=1,000. It shows retroLP with both the

Dantzig and steepest edge column choice rules. It also shows MINOS (revised

method). Figure 7.8, “Speedup Relative to MINOS (m=500, n=1,000)”, shows

MINOS time divided by retroLP time as a function of density for the same data. The

density at which the Dantzig column choice rule takes over MINOS is around 70%.

The density at which the steepest edge column choice rule takes over MINOS is

between 2% and 5%. The points in both of these figures represent nine runs each, one

run for each of the three generators combined with three different seeds.

Figure 7.9, “Speedup Relative to MINOS (m=500, n=1,000)”, is similar to

Figure 7.8. It shows MINOS time divided by retroLP time as a function of density for

problems with m=1,000 and n=5,000. These runs executed a tableau reinversion once

every 5,000 iterations. This reinversion cost is very close to 20% extra time for the

Dantzig column choice rule and about 15% extra time for the steepest edge column

choice rule. This is why the Dantzig version doesn’t catch up with the revised in this

figure.

It should be noted that although partial pricing doesn’t help in retroLP for the

classical column choice rule, it would make a big difference in the steepest edge rule.
106

The timing in this section was done on the PC environment mentioned in

Section 5.3. The UNIX environment was used for all the other timing.
107

retroLP vs. MINOS

350

300

250

200

Time (secs.)
150

100

50

0
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Density

retroLP MINOS Steepest Edge

Figure 7.7 – retroLP vs. MINOS


108

Speedup Relative to MINOS (m=500, n=1,000)

2.5

1.5

Time (secs.)
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.5

0
Density

Steepest Edge Dantzig Choice

Figure 7.8 – Speedup relative to MINOS (m=500, n=1,000)


109

Speedup Relative To MINOS (m=1000, n=5000)

3.5

2.5

Time (secs.)
1.5

1
0 0.2 0.4 0.6 0.8 1 1.2

0.5

0
Density

Steepest Edge Dantzig Choice

Figure 7.9 Speedup relative to MINOS (m=1,000, n=5,000)


110

8. Summary, Applications and Future Work

8.1 Summary

In conclusion our method has made large linear programs more tractable. It is

especially good for large dense problems or when we have use of the optimal number

of processors (even for problems that are not dense). By taking advantage of

parallelism it can also divide the extra load of alternate column choice rules, which

shorten the number of pivot steps required to solve the problem.

We have

1) Implemented a good general-purpose simplex program, called retroLP, using

the full tableau method. This implementation runs both on UNIX machines

and on PC’s

2) Extended it to work on distributed systems using both MPI and IP

programming. This is called dpLP.

3) Developed performance models for both computation and communication to

optimize the number of processors. Although our network allowed verification

for only an Ethernet broadcast model, we gave expressions for several

different communication models.

4) Determined at what density our method becomes more efficient than existing

revised simplex implementations

5) Analyzed the number of processors needed to make our parallel method more

efficient than existing revised simplex implementations even for sparse

problems
111

6) Analyzed when the other column choice rules, in particular the steepest edge

column choice rule, can help our parallel method achieve faster total running

times than existing revised simplex implementations. This efficiency is

achieved at lower densities and with fewer processors than when using the

classical column choice rule.

8.2 Applications with dense matrices

There are a number of applications that lead to dense linear programs. One is

data mining [Bradley, 1999] and text categorization using the method of [Bennett and

Mangasarian, 1992], [Bosch and Smith, 1998]. The idea is, given a collection of

different articles and a group of categories, to put each article into its proper category.

We can take a document and for a given category decide whether or not the document

is a member of that category. First n keywords are chosen to help distinguish between

categories. The variables in the LP correspond to these words. For each article each

keyword is counted to get its frequency in that article. The vector of these frequencies

defines a point in n space, which corresponds to a row in the LP. For most groups of

words the resulting tableau will be sparse. If instead of using individual words we

aggregate groups of words, the problem will become smaller and denser. The

grouping is known as feature compression and extraction. One solves the resulting

dense linear program to find a hyperplane that separates the points in the category

from the points out of the category.

Digital Filter Design [Steiglitz, 1992] and De-noising of images [Mallat,

1999 pg. 419] give rise to other dense applications. LP Relaxations of Machine

Scheduling Problems [Uma, 2000] is another dense application. The idea is to


112

schedule a number of tasks in such a way as to minimize the total time elapsed. The

rows correspond to the points in time. The variables (columns) correspond to the

tasks.

Eckstein et al [1995] cite other dense applications such as dense master

problems sometimes generated by the Dantzig-Wolfe or Benders decomposition,

digital filter design, data analysis and classification and financial planning.

8.3 Future work

A. Other kinds of bids:

a. To analyze whether using dpLP with other column choice rules such

as exterior pivoting [Eiselt and Sandblom, 1985, 1990] would decrease

overall program time. This would be an extension to our analysis of

the steepest edge method.

b. To analyze block pivoting. This is another column choice method. A

whole group of non-basic variables is chosen to enter the basis at once.

These variables correspond to a “block” of columns and pivot rows.

It would be interesting to see how this method would do in the context

of retroLP and dpLP [Padberg, 1995 pgs. 70-75].

B. Special structures. Many Linear programming problems have special

structures.

a. One example is the “stepwise block structure.” This has the

following form:
113

Maximize c1 x1 + c 2 x 2 + c3 x 3 + c4 x4 + L L c n −1 x n −1 cn xn
Subject to a11 x1 + a12 x 2 op b1
a 21 x1 + a 22 x 2 op b2
a33 x3 + a34 x 4 op b3
a 43 x3 + a 44 x 4 op b4
L L M M
L L M M
a m −1n −1 x n −1 a m −1n x n op bm −1
a mn −1 x n −1 a mn x n op bm
where op refers to any of the relations =, <= or >= and the
variables can be bounded from above and below.

b. Using some dedicated processors for column generation.

It would be interesting to see how retroLP and dpLP could be specialized

for linear programs with special structures [Hadley, 1962].

C. To further analyze divvying up unequal numbers of columns to the

processors. There are three possible reasons.

a. To avoid network contention between processors. When each

processor offers its “bid” during column choice, it is possible for

multiple messages to be transmitted simultaneously if processors

finish their pivot and column choice at the same time. This might

actually slow down communication. One way around that would

be to make sure processors finish at slightly different times by

giving them unequal loads. This was discussed in Section 5.8.

b. Extensions to heterogeneous computing (processors) (load

sharing). Clearly if the processors are not similar in terms of speed


114

and memory we would try to even them out by giving them

different computation loads.

c. Heterogeneous communication as explained below.

D. Extensions to heterogeneous communications. These are two possible

enhancements for the case of using networks other than a simple Ethernet.

a. Load sharing to compensate for the extra delay caused by

communication from outside networks.

b. Using MPI, TCP sockets or UDP sockets with error checking.

E. Dynamic load balancing and fault tolerance. Figuring out how to deal with

varying processor availability due to congestion or failure. Looking into

schemes such as column duplication or regeneration.

F. Use of partial pricing for the steepest edge rule in the full tableau method.
115

Appendix A. Form of Linear Program input: LPB and MPS

In this appendix we describe:

A. The internal linear programming format, LPB, used by our programs.

B. The MPS format

Each of these is illustrated by an example.

A general linear program is of the form:

Maximize z = cx
x

Subject to Ax op b
lj ≤ xj ≤ uj j = 1,..., n
where op refers to any of the relations =, <= or >= .

A is the constraint matrix. l and u are the lower and upper bounds respectively.

x is a vector of unknowns and b is a vector of the right hand side values.

lj and uj can be negative or positive infinite. If both bounds of a particular variable are

infinite, the variable is said to be “free.” If both bounds of a particular variable are the

same (lj=uj), the variable is said to be “fixed.” If lj is not equal uj and both are finite

the variable is said to be “bounded.” To be consistent with the C programming

language we often denote vectors by "[ ]" and matrices by "[ ][ ]."

A.1 Preprocessing into LPB format

retroLP and dpLP both use the simplex method with bounded variables. They

require a two-dimensional array, two integers and three vectors as input.

a[m+2][n+1], m, n, vectors nl[n], nu[n], ntype[N].

(a[ ][ ], nl[ ], nu[ ], m and n are actually passed via the structure given in Table A.1
116

The output of the simplex is in the a[ ][ ] matrix at the end. Another function

extracts that information and outputs it.

Much of the data is stored in the matrix a[ ][ ] which has m+2 rows and n+1

columns where n is the number of variables and m is the number of constraints. The

0th column holds the b vector and the 0th row holds the c (objective) vector. The

(m+1)th row stores the Phase 1 objective. Constraints in the matrix can be a mixture

of less than, greater than and equalities. Vectors nl[ ], bl[ ], nu[ ] and bu[ ] hold the

upper and lower bounds of the variables. nrange[ ] and brange[ ] are lists of flags

indicating whether a variable is currently in between its bounds, lower than a lower

bound, at its upper bound or at both bounds (for fixed variable only). The values of

nrange and brange are determined by the program and need not be input.

These data structures describing the linear programming problem are all

stored within the data structure given in Table A.1.


117

typedef struct
{
char *name; // name of problem (usually file name (100))
long m; // number of rows
long n; // number of columns
// (actually, the matrix is (m+2)x(n+1))
long mm; // index of the current objective row; m+1
// for Phase I; 0 for Phase 2.
double ** a; // points to the constraint matix ((m+2)x(n+1))
// (n+1)
var_type *ntype; // types of non-basic variables: fixed.
// lower bounded, upper bounded, both, free. (n+1)
double *nl; // lower bounds of non-basics n
double *nu; // upper bounds of non-basics n
long *jnonbasic; // indices of current non-basic variables
var_range *nrange; // non-basic values within, at, below,
// or above bounds.
double *x; // current value of non-basic variables (to
// implement EXPAND) (n+1)
var_type *btype; // types of basic variables: fixed. lower
// bounded, upper bounded, both, free. (n+1)
double *bl; // lower bounds for basic variables
double *bu; // upper bounds for basic variables
long *ibasic; // indices of current basic variables (m+1)
var_range *brange; // basic values within, at, below,
// or above bounds.
double *b; // current value of basic variables
} LP_state;

Table A.1
Table A.1-Data structure for retroLP and dpLP

Here is an example of how an outside driver program would preprocess a

linear programming problem and fill the data structures just mentioned.
118

Maximize z = 2x + 2 y − 5z
Subject to 5x − 4 y + 3z ≤ 4
5x + 3y + 3z ≥ 2
2x + 3 y − 4 z = 10
2 ≤ y ≤ 10, x, z ≥ 0

First add a slack, surplus and artificial to the constraints (this can be done

implicitly).

Maximize z = 2 x + 2 y − 5z
Subject to 5x − 4 y + 3z + s1 = 4
5x + 3 y + 3z + s 2 = 2
2x + 3 y − 4 z + s3 = 10
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0

Solve for si for all i.

Maximize z = 2x + 2 y − 5z
Subject to s1 = 4 − 5 x + 4 y − 3z
s2 = 2 − 5x − 3 y − 2 z
s 2 = 10 − 2 x − 3 y + 4 z
2 ≤ y ≤ 10, x, z ≥ 0
s1 ≤ 0, s 2 ≥ 0, s 3 = 0

This last representation is called a “Dictionary.”

The A matrix is filled with the coefficients of the dictionary.

0 2 2 −5
4 −5 4 −3
A[ ][ ] = 2 −5 −3 −2
10 −2 −3 4
0 0 0 0
nl[ ] = 0 2 0
nb[ ] = 0 0 0
nu[ ] = INF 10 INF
nb[ ] = INF INF INF
119

nrange[ ]=L L L - (U or L, BOTH or FREE), where U=at its upper bound, L=at

its lower bound, BOTH means it’s a fixed variable, and FREE means it’s a free

variable (not bounded on either side).

There are 3 nonbasic variables thus n=3.

There are 3 basic variables (a slack, surplus and artificial variable) corresponding

to 3 constraints thus m=3.

The top (0th) row is the objective the bottom (m+1)th row is place for a phase 1

objective. The first (0th) column is the right hand side constants. The resulting

matrix is m+2 by n+1.

The 0th column of a[ ][ ] must be all nonnegative.

For a minimization problem the objective must be negated.

A.2 The MPS format

MPS is a standard format originally developed by IBM for describing

linear and integer programs. More details on MPS format can be found in

Murtagh [1998]. It is the format currently supported by our programs. MPS has a

fixed and free format. The fixed format is the one used by MINOS and our code;

it is one we will describe. Each row in the file has fields, which are in the specific

column locations given in figure A.1.


120

Field 1 Field 2 Field 3 Field 4 Field 5 Field 6


Columns 2 − 3 Columns 3 − 12 Columns 15 − 22 Columns 25 − 36 Columns 40 − 47 Columns 50 − 61

Figure A.1

NAME, ROWS, COLUMNS, RHS, BOUNDS, RANGES and ENDATA are

keywords delimiting the different sections of the file. They all begin in column 1

of their respective rows. The row starting in column 1 with “NAME” can have an

8-character problem name in Field 2. Every row in the ROWS section has a one-

character keyword (N, E, L, G) in Field 1 followed by a row name in Field 2. The

character in Field 1 specifies the type of constraint that row will contain. There

are four possible row types. They are an objective (N), an equality (E), a less than

(L) or a greater than (G).

The COLUMNS section consists of a column name in Field 2. Fields 3 and 4

contain a row name - value combination. We then have a value to put into the row

and column given in Fields 3 and 2 respectively. Fields 4 and 5 contain another

row name-value. Fields 4 and 5 can be left blank. It is also legal to leave Fields 4

and 5 blank.

The RHS section consists of a right hand side (rhs) name in Field 2. Fields 3

and 4 contain a row name-value combination. Just as by the column section, we

then have a value to put into the row and rhs given in Fields 3 and 2 respectively.

(The MPS format supports multiple right hand sides; our implementation allows

only one.) Fields 4 and 5 contain another row name-value. Fields 4 and 5 can also

be left blank.
121

Every row in the BOUNDS section has a two-character keyword (UP, LO,

FX, FR) in Field 1 followed by a bound name in Field 2. (The MPS format

supports multiple bounds; our implementation allows only one.) The keyword in

Field 1 specifies what type of bound the variable (column name) specified in

Field 3 will be. There are four possible bounds. They are an upper bounded

variable (UP), a lower bounded variable (LO), a fixed bounded variable (FX) or a

free bounded variable (FR). Fields 3 and 4 contain a column name-value

combination that specifies a value in the column name (variable) given for the

bound in Field 2 (for a given problem solution there will be only one bound).

As an example the following is an LP in MPS format given in a text file:

NAME TESTPROB
ROWS
N COST
E EQ1
E EQ2
COLUMNS
XONE EQ1 1
XTWO EQ2 1
XTHR COST - 10 EQ1 - 1
XTHR EQ2 -1
XFOUR COST 100 EQ1 1
XFOUR EQ2 1
RHS
RHS1 EQ1 2 EQ2 3
BOUNDS
UP BND1 XTHR 1
UP BND1 XFOUR 1
LO BND1 XFOUR - 10
ENDATA
122

There are 3 rows; the first is row "COST" which is an objective row denoted by

keyword N. The second and third are rows called "EQ1" and "EQ2" which are

equality rows denoted by keyword E. Another two keywords not in this file are G and

L for greater than and less than constraints. There are four columns with names

"XONE", "XTWO", "XTHR" and "XFOUR". On the right of the column name are

one or two row names with values indicating all the values for that column. Next the

right hand side (b vector) is given in the same way the columns were given. Finally

there are three bounds. Two upper bounds denoted by keyword UP and one lower

bound denoted by LO. There are another two types of variable bounds; FX for fixed

variable and FR for free variable. There is also another section called RANGES.

(BND1 is just the "name" given to the bound in case there is another set of bounds.

RHS1 is the same. Usually there is only one RHS and one set of bounds.) Our

implementation only looks at the first set of bounds or RHS’s if there are more than

one. All values not mentioned are assumed to be 0. The problem can be a max or min

although it is usually assumed to be min. If bounds are not given for a variable

(column) a lower bound of 0 and an upper bound of +INF are assumed.

The above MPS file corresponds to the following LP:

XONE XTWO XTHREE XFOUR


COST : − 10 100
EQ1 : −1 1 −1 =2
EQ2 : −1 1 −1 =3

XONE, XTWO ≥ 0 , 0 ≤ XTHREE ≤ 1 , - 10 ≤ XFOUR ≤ 1

Further information on the MPS format can be found in [Murtagh, 1998] and at

ftp://softlib.cs.rice.edu/pub/miplib .
123

Appendix B. Program description: retroLP and dpLP

retroLP is a full scale implementation of the standard simplex method written

in C++ compatible C. It takes input in the MPS format and supports all the options for

linear programming implied by the format except that multiple runs are not yet

supported. That is, our implementation expects at most one set of right hand side

constants, range sets, and bounds, respectively.

Three column choice rules are supported: The classical rule of Dantzig, the

steepest edge rule, and the maximum change rule. The algorithm can be easily

extended to allow the same problems to use differing column choice rules in different

iterations.

To preserve numerical stability, our implementation uses full pivoting

reinversion. The same procedure can be used to support basis crashing and warm

restarts. We use the specification for MPS given in Murtagh and Saunders [1998].

retroLP uses the EXPAND degeneracy procedure of Gill et al [1989] to improve

numerical stability and to avoid stalling and degeneracy.

retroLP is effective for dense linear programs with moderate aspect ratio.

Such problems arise, for example, in digital filter design, image processing, curve

fitting, and pattern recognition. The program can start from any assignment of values

to the variables, within bounds or not. In particular, retroLP can be used in a hybrid

computation with an interior point method along the lines suggested by Bixby et al

[2000].

In the simplex routine there are three steps in each iteration.


124

1. column selection using a column choice rule: column()

2. row selection: row()

3. the pivot: pivot()

Within column() there are many different possible column choice rules only one of

which is usually used for a given run, although mixing them is possible.

B.1 retroLP - the serial implementation

retroLP first preprocesses data that comes in the MPS format. This was

described in Appendix A. The main simplex routine fills up row m+1 with the Phase

1 objective. It then performs the following steps:

Phase 1

1) Do column selection. There are 3 possibilities:

a) If no column can be selected and the objective value is 0 Phase 1 is

over: do clean up and begin Phase 2.

b) If no column can be selected and the objective value is not 0 the LP is

inconsistent. Report this and stop.

c) If a column kp has been selected continue with the next step.

2) Do row selection. Select the row whose constraint is the first to be

violated. There are 2 possibilities:

a) The bound of variable kp is the first to be violated: ip is set to 0: let kp

go to its other bound.

b) Row i's constraint is the first to be violated: ip is set to i: go to step 3.

3) Do a pivot on element (ip,kp).

loop on Phase 1.
125

Phase 2 uses row 0 for the objective.

Phase 2

1) Do column selection.

a) If no column can be selected we are at the optimum and Phase 2 is

over.

b) If a column kp has been selected continue with the next step.

2) Do row selection. Select the row whose constraint is the first to be

violated. There are 3 possibilities:

a) The bound of variable kp is the first to be violated: ip is set to 0: let kp

go to its other bound

b) Row i's constraint is the first to be violated: ip is set to i: go to step 3.

c) No constraint is violated: ip is set to -1: the problem is unbounded.

Report this and stop.

3) Do a pivot on element (ip,kp).

loop on Phase 2.

B.2 dpLP – the parallel implementation

dLP first preprocesses data that comes in the MPS format. This was described

in Appendix A. dpLP then divides the n columns into p groups. Each of the p

processors gets approximately n/p of the columns. Each processor stores its data in

the data structure given in Table A.1.


126

The main simplex routine fills up row m+1 with the Phase 1 objective for all

processors. It then performs the following steps:

Phase 1

1) Each processor does column selection on its columns; the global max is

calculated and sent to all the processors. There are 3 possibilities:

a) If no column can be selected and the objective value is 0 Phase 1 is over:

do clean up and begin Phase 2.

b) If no column can be selected and the objective value is not 0 the LP is

inconsistent. Report this (to all processors) and stop.

c) If a column kp has been selected continue with the next step.

2) The winning processor does row selection. It selects the row whose constraint

is the first to be violated. There are 2 possibilities:

a) The bound of variable kp is the first to be violated: ip is set to 0: let kp go

to its other bound. All the processors update their data.

b) Row i's constraint is the first to be violated: ip is set to i: go to step 3.

3) The pivot column of the processor with the global max (winning processor) is

broadcast to all the processors together with the pivot row. Do a pivot on

element (ip,kp). The processors do this alone using the identical copy of the

winning column.

loop on Phase 1.

In Phase 2 each processor will use their row 0 for the objective.

Phase 2
127

1) Each processor does column selection on its columns; the global max is

calculated and sent to all the processors

a) If no column can be selected we are at the optimum and Phase 2 is

over.

2) The winning processor does row selection. It selects the row whose

constraint is the first to be violated. There are 3 possibilities:

a) The bound of variable kp is the first to be violated: ip is set to 0: let kp

go to its other bound. All the processors update their data.

b) Row i's constraint is the first to be violated: ip is set to i: go to step 3.

c) No constraint is violated: ip is set to -1: the problem is unbounded.

Report this (to all processors) and stop.

3) The pivot column of the processor with the global max (winning

processor) is broadcast to all the processors together with the pivot row.

Do a pivot on element (ip,kp). The processors do this alone using the

identical copy of the winning column.

loop on Phase 2.
128

References

Bennett, K.P. and Olvi Mangasarian, “Robust Linear Programming Discrimination of

Two Linearly Inseparable Sets,” Optimization Methods and Software vol. 1 1992

pgs. 23-24.

Bixby, Robert E. and Alexander Martin, "Parallelizing the Dual Simplex Method,"

Informs Jourmal on Computing vol. 12 num. 1 Winter 2000 pgs. 45-56.

Bosch, Robert and Jason Smith, “Separating Hyperplanes and the Authorship of the

Disputed Federalist Papers,” Mathematical Monthly August-September 1998

Bradley, P. S. Usama Fayyad and Olvi Mangasarian, “Mathematical Programming for

Data Mining: Formulations and Challenges,” Informs Journal on Computing vol. 11

no. 3 Summer 1999 pgs. 217-238.

Bradley, P. S. and Olvi Mangasarian, “Feature Selection via Concave Minimization

and Support Vector Machines.” Machine Learning: Proceedings of the Fifteenth

International Conference(ICML '98) J. Shavlik editor, Morgan Kaufmann, San

Francisco, California 1998 pgs. 82-90,


129

Bruck, Jehoshua, Danny Dolev, Ching Ho, Rimon Orni and Ray Strong, “PCODE:

An efficient and Reliable Collective Communication Protocol for Unreliable

Broadcast Domains,” IEEE 9th International Parallel Processing Symposium (IPPS)

(1063-7133/95) April 1995 pgs. 130-139.

Chvátal, Vasek, Linear Programming. Freeman, 1983.

Comer, Douglas and David Stevens, Internetworking With TCP/IP Volume III:

Client-Server Programming and Applications, BSD Socket Version. 2nd edition

Prentice Hall 1996

Culler, David and Richard Karp et al, “LogP: Towards a Realistic Model of Parallel

Computation,” ACM Symposium on Principles and Practice of Parallel Programming

(PPOPP) May 1993.

D’Alessio, S., K. Murray, A. Kershenbaum and R. Schiaffino, “Category Levels in

Hierarchal Text Categorization,” Proceedings of the Third Conference on Empirical

Methods in Natural Language Processing June 1998.

Dongarra, Jack and Tom Dunigen, “Message-passing performance of various

computers,” University of Tennessee Technical Report CS-95-299 May 1996


130

Dongarra, Jack and Francis Sullivan, "Guest Editor's Introduction: The Top Ten

Algorithms," Computing in Science and Engineering January/February 2000 pgs. 22-

23.

Eckstein, Jonathan, I. Boduroglu, L. Polymenakos and D. Goldfarb, "Data-Parallel

Implementations of Dense Simplex Methods on the Connection Machines CM-2,"

ORSA Journal on Computing (INFORMS) vol 7 no. 4 Fall 1995 pgs. 402-416.

Eiselt, H.A. and C.L. Sandblom, "External pivoting in the simplex algorithm,"

Statistica Neerlandica vol. 39 no. 4 1985.

Eiselt, H.A. and C.L. Sandblom, "Experiments with External Pivoting." Computers

Operations Research vol. 17 no 10 1990 pgs. 325-332.

Forrest, John and Donald Goldfarb, “Steepest-edge simplex algorithms for linear

programming.” Mathematical Programming vol. 57 1992 pgs. 341-374.

Geist, G.A., J.A. Kohl and P.M. Papadopoulos, "PVM and MPI: a Comparison of

Features." Calculateurs Paralleles vol. 8 no. 2 June 1996 pgs. 137-150.

Gill, P.E., W. Murray, M.A. Saunders, M.H. Wright, “Maintaining LU factors of a

general sparse matrix,” Linear Algebra and its Applications vol. 88-89 1987 pgs. 239-

270.
131

Gill, P.E., W. Murray, M.A. Saunders, M.H. Wright, “A practical anti-cycling procedure

for linearly constrained optimization,” Mathematical Programming vol. 45 no. 3 1989

pgs. 437-474.

Goldfarb, D. and J.K. Reid, "A practical steepest-edge simplex algorithm,"

Mathematical Programming vol. 12 1977 pgs. 361-371.

Goudreau, Mark, Kevin Lang, Satish Rao, Torsten Suel and Thanasis Tsantilas,

“Portable and Efficient Parallel Computing Using the BSP Model.” IEEE

Transactions on Computers vol. 48 no. 7 1999 pgs. 670-689.

Hadley, G., Linear Programming. Addison Wesley 1962.

Hall, J.A.J. and K.I.M. McKinnon, “ASYNPLEX, an asynchronous parallel revised

simplex algorithm,” Technical Report MS95-050a Department of Mathematics

University of Edinburgh July 1997.

Hall, J.A.J. and K.I.M. McKinnon, “Update procedures for the parallel revised

simplex method,” Technical Report MSR 92-13 Department of Mathematics

University of Edinburgh September 1992.


132

Harris, Paula, “Pivot selection Methods of the Devex LP Code.” Mathematical

Programming vol. 5 1973 pgs 1-28.

Karp, R. and V. Ramachandran, “A Survey of Parallel Algorithms for Shared

Memory Machines,” Handbook for Theoretical Computer Science (J. van Leeuwen

editor), North Holland Amsterdam, 1990 pgs. 869-941.

Karp, Richard and A. Sahay, E.Santos and K. Schauser, "Optimal Broadcast and

Summation in the LogP Model," Symposium on Parallel Algorithms and

Architectures (SPAA) 1993 pgs. 142-153.

Kuhn, Harold and Richard Quandt, “An experimental study of the Simplex Method,“

Proceedings of symposia in applied mathematics Vol. XV (American Mathematical

Society Providence RI in Princeton University) 1963.

Luby, Michael and Noam Nisan, “A Parallel Approximation Algorithm for Positive

Linear Programming,” Association for Computing Machinery (ACM) (0-89791-591-7)

1993 pgs. 448-457.

Maekawa, Oldehoeft and Oldhoeft, Operating Systems: Advanced Concepts.

Benjamin Cummings 1987.


133

Mallat, Stéphane, A Wavelet Tour of Signal Processing. 2nd ed. Academic Press 1999 pg.

419.

Martel, Charles, “Maximum finding on a multiple access broadcast network,”

Information Processing Letters vol. 52 no. 1 1994 pgs. 7-13.

Murtagh, Bruce and M Saunders, “MINOS 5.5 User's Guide” Technical report SOL

83-20R Stanford University 1983-1998.

Nash, Stephen and Ariella Sofer, Linear and Nonlinear programming. McGraw-Hill

1996.

Ohio supercomputer center, MPI Primer/Developing with LAM. Ohio State

University 1995.

Padberg, Manfred. Linear optimization and extensions. Springer 1995.

Reid J.K., “Fortran subroutines for handling sparse linear programming bases,”

Report R8269 Atomic Energy Research Establishment Harwell England 1976.

Reid JK “A sparsity exploiting variant of the Bartels-Golub decomposition for linear

programming bases,” Mathematical Programming vol. 24 1982 pgs. 55-69.


134

Snir, Marc and Steve Otto et al, MPI: The complete Reference. MIT Press 1996.

Steiglitz, Kenneth, Parks and Kaiser. “METEOR: A Constraint-Based FIR Filter

Design Program.” Institute of Electrical and Electronics Engineers (IEEE)

Transactions on Signal Processing vol. 40 no. 8 August 1992.

Stunkel, Craig and Daniel Reed, “Hypercube Implementation of the Simplex

Algorithm,” Association for Computing Machinery (ACM) 1988 pgs. 1473-1482.

(0-89791-278-0 )

Uma, R.N., “Theoretical and Experimental Perspectives on Hard Scheduling

Problems,” PhD Dissertation Polytechnic University July 2000.

Valiant, Leslie, "A Bridging Model for Parallel Computation," Communications of

the ACM vol. 33 no. 8 1990 pgs. 103-111.

Wolfe, Philip and Leola Cutler, “Experiments in Linear Programming.” Recent

advances in mathematical programming, Graves and Wolfe eds., McGraw Hill New

York 1963.

You might also like