You are on page 1of 20

Computers and Structures 73 (1999) 545±564

www.elsevier.com/locate/compstruc

Parallel simulated annealing for structural optimization


J.P.B. Leite, B.H.V. Topping*
Department of Mechanical Engineering, Heriot±Watt University, Riccarton, Edinburgh, EH14 4AS, UK
Received 1 October 1996; accepted 12 June 1998

Abstract

The simulated annealing (SA) algorithm has proven to be a good technique for solving dicult combinatorial
optimization problems. In engineering optimization the SA has emerged as an alternative tool to solve problems
which are dicult to solve by conventional mathematical programming techniques. The algorithm's major
disadvantage is that solving a complex system may be an extremely slow convergence process, using much more
processor time than some conventional algorithms. Consequently, simulated annealing has not been widely accepted
as an optimization algorithm for engineering problems. Attempts have been made to improve the performance of
the algorithm either by reducing the annealing length or changing the generation and the acceptance mechanisms.
However, these faster schemes, in general, do not inherit the SA properties of escaping from local minima. A more
ecient way to reduce the processor time and make the SA a more attractive solution for engineering problems is to
add parallelism. However, the implementation and eciency of parallel SA models are in general problem
dependent. Thus, this paper considers the evaluation of parallel schemes for engineering problems where the
solution spaces may be very complex and highly constrained and function evaluations vary from medium to high
cost. In addition, this paper provides guidelines for the selection of appropriate schemes for engineering problems.
An engineering problem with relatively low ®tness evaluation cost and strong time constraint was used to
demonstrate the lower bounds of applicability of parallel schemes. # 1999 Civil-Comp Ltd and Elsevier Science
Ltd. All rights reserved.

1. Introduction algorithm is motivated by an analogy to the behaviour


of a physical system in a heat bath. Both methods are
During the past three decades, there has been a able to process functions with arbitrary degrees of
growing interest in problem solving algorithms inspired nonlinearities, discontinuities and stochasticity. They
by natural systems in physics and biology. An import- are also able to handle discrete, continuous and mixed
ant application for these algorithms is to solve combi- variables. Their only disadvantage is the potentially
natorial optimization problems, i.e. the process of large amount of time required to converge to a near-
®nding one or more optimal or near optimal solutions optimal solution.
that minimize an objective function associated with the Substantial improvements have been made in
solution. The best known algorithms in this class increasing the convergence speed of GAs without loss
include genetic algorithms and simulated annealing. of the methods' robustness. The use of advanced gen-
While the genetic algorithm (GA) attempts to simulate etic operators [41] reduces the processor time to a
the nature's genetic breeding, the simulated annealing highly competitive level. Further study [63] shows that
GAs may achieve high levels of parallelism and, with
the use of parallel hardware, ``superlinear'' speedups
* Corresponding author. may be obtained for engineering problems.

0045-7949/99/$ - see front matter # 1999 Civil-Comp Ltd and Elsevier Science Ltd. All rights reserved.
PII: S 0 0 4 5 - 7 9 4 9 ( 9 8 ) 0 0 2 5 5 - 7
546 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

Attempts to speed up the simulated annealing (SA) with each other than with conventional optimization
either by performing changes to the algorithm or by techniques. The exchange of knowledge regarding
the use of parallel hardware has normally resulted in these two algorithms has helped their mutual develop-
problem dependent behaviour. The parallelization of ment. In fact, the SA can be better viewed as a tech-
the SA is not a trivial task because of the sequential nique that compliments the GA in the scope of
nature of the algorithm. In addition, determining an optimization. Together they may be able to handle all
ecient mechanism to generate states and selecting an kinds of engineering problems and with minimal
e€ective annealing schedule require a certain amount requirements when compared with conventional tech-
of art. niques which often require sensitivities and other infor-
It is unlikely that parallel SA implementations will mation.
even achieve maximal eciency of the parallel hard-
ware, as frequently happens with GAs. Nevertheless,
as the application of GAs and SAs does not necessarily 3. Canonical simulated annealing
overlap their applications, parallel SAs are an ecient
alternative solution procedure for some engineering The common deterministic optimization process fails
problems. when the design variables fall into a valley where a
local optimum exists. Once this happens there is gener-
ally no way out of the valley using mathematical pro-
2. Simulated annealing versus genetic algorithm gramming techniques. Thus, the problem is usually
retried with a new starting point in the hope of ®nding
GAs are population-based instead of point-based as a global optimum. Since SA accepts an increase of the
SAs, i.e. they attempt to evolve a number of complex objective function value within a certain probability, it
systems concurrently while SAs develop one and re®ne is possible to escape out of the valley by uphill move-
it. The implicit parallel properties [7,26] gained by ment and to thus continue the search in a good direc-
evolving a population of points in the search space tion. Eventually, the SA process will have a good
concurrently suggests that GAs have a natural map- chance of reaching a global optimum.
ping onto parallel architectures. Revised advanced op- The basic concept of SA arises from an analogy of
erators [41] which improve substantially the metallurgical annealing. Annealing denotes a physical
performance of the GAs have demonstrated that they process in which a solid in a heat bath is melted at
are a powerful multi-purpose optimization tool. high temperatures until all molecules of the melted
By contrast, SAs can be easily implemented with solid can move freely with respect to one another, fol-
quite a minimal degree of coding as compared with lowed by cooling until thermal mobility is lost.
GAs and other nonlinear optimization algorithms. The perfect con®guration for a crystal is the one in
This feature has made the SA very attractive for hy- which all the atoms are aligned in a low level lattice.
bridizations and also for use in cascade procedures As the metal cools, atoms line up with their neigh-
after the use of quick but less accurate optimizers. bours, but in di€erent regions atoms may point in
For well-understood problems, SAs may easily be di€erent directions. Whole regions of atoms must then
tailored to generate better than random solutions be reversed in order to escape this state of local opti-
which may be extremely valuable to speed up the op- mum having two unaligned crystals, as shown in
timization process. This tailoring approach is far more Fig. 1. The energy to do this is available as the heat in
dicult and not recommended for GAs, since they rely the metal. At any temperature above zero atoms
on a good balance between exploration of space and vibrate with an energy that depends on the tempera-
exploitation of solutions obtained by complex oper- ture of the metal [54]. This energy may cause atoms to
ators. reorientate themselves with a probability that, in the
In constrained optimization, SAs may take advan- canonical simulated annealing (CSA), depends on the
tage of their generation mechanism to reduce the time current temperature of the system, given by the
spent in function evaluations by preventing recalcula- Boltzmann distribution:
tions of constraints which are not attached to the vari-
able modi®ed. GAs require all constraints to be Pr…E †0e…ÿE=kB T † …1†
evaluated for each new point in the search space.
For engineering problems involving large numbers This distribution expresses the concept that a system in
of variables, the binary representation of GA popu- thermal equilibrium at temperature T has its energy
lations may require large amounts of computer mem- probabilistically distributed among all di€erent energy
ory above standard hardware capabilities. states E.
Despite of some behavioural di€erences, simulated As the temperature goes down it becomes more
annealing and genetic algorithms share more features dicult for the system to undergo large changes.
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 547

Fig. 1. Simulated annealing: crystals forming in a cooling


metal.

When the temperature approaches zero, ¯ipping Fig. 2. The Boltzmann probability as a function of the tem-
perature.
becomes impossible and the state of the atoms is
frozen (Fig. 2).
1
In this way, if the cooling is carried out suciently Tf ˆ …2†
log …1 ‡ n†
slowly, the atoms are often able to arrange themselves
in a low energy state and form a pure crystal that is
completely ordered. However, if the cooling is done
too quickly for the solid to reach thermal equilibrium
at each temperature value, the solid forms a polycrys- 4. Convergence speed versus accuracy
talline or metastable amorphous state rather than a
low energy lattice. In this case reheating and cooling is Theoretically speaking, when Tn is slowly reduced
required to form a pure crystal. by a strict cooling schedule with Tn 4 0 and n 4 1,
In other words, the system sometimes goes uphill as the ®nal state will be the global optimal solution with
well as downhill; but the lower the temperature the less probability 1. In practice, it is dicult to determine the
likely is the possibility of any signi®cant uphill excur- length of the Metropolis [50] sampling process at each
sion. The algorithm may be formulated by commen- value of Tn and how slowly Tn should decrease. The
cing at an initial solution or state that is repeatedly theoretical formula (2) for which the ergodic property
improved by making small local random alterations to of the canonical simulated annealing has been proved
yield a better solution. A pseudo-C procedure is pre- is far too slow to be used in practice. Instead, some
sented as follows: fast cooling schedules are usually used for which
reason the ®nal solution is probably worse than some
T=T0; solutions earlier encountered. This also explains the
while (T>Tfreezing){ fact that sometimes the solution obtained by the SA is
do until (Equilibrium is reached){ even worse than the solutions obtained by some heuris-
Alternation (State i 4 State j); tic methods. Nevertheless, improvements in the gener-
if (DEij < 0) then ation mechanism and/or the cooling schedule may
accept update (State j); result in more ecient implementations without de-
else
terioration of the quality of the ®nal solution. An
r=random number [0, 1); interesting approach proposed by Szu and Hartley
if (exp(ÿDEij/T )>r) then
[59,60] is known as fast simulated annealing (FSA).
accept update (State j);
FSA uses a Cauchy probability density to generate a
else
random process which allows large steps more often
than the Gaussian probability used by the CSA. This
refuse update (State i);}
approach leads to an inverse linear cooling rate rather
Tn=Tn  Tf ;}
than an inverse logarithmic cooling rate.
where exp(ÿDEij/T ), known as a Boltzmann factor, is A second methodical improvement in SA was pro-
the probability of acceptance of a deterioration (i.e. a posed by Greene and Supowit [24]. These authors pre-
worse state), and T0=initial temperature, sented a rejectionless mechanism to generate new
Tn=temperature at the time n, Ei=the energy value solutions with a probability proportional to the e€ect
for the current state, Ej=the energy value for a candi- of the transition on the cost function. Thus a sub-
date state, Tf is the reduction factor for the tempera- sequent solution is chosen from the neighbourhood of
ture which is a function of the time n and given by: a current solution and no rejection occurs.
548 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

Other attempts to speed up the SA simply replaced the late 1980 s. In 1988, Elperin [18] described the
expression (2) in order to obtain faster cooling sche- basic ideas of Monte Carlo annealing algorithms for
dules. The exponential cooling schedules (3), or simu- structural optimization using discrete design par-
lated quenching (SQ), as they are known, are the most ameters. In the same year, Salama et al. [56] presented
common in practice. an application of SA to optimal placement of actua-
tors and/or sensors for large ¯exible space structures.
Tn ˆ T0 :an …3† The shape control of large ¯exible space structures is
of great interest to structural designers, because inac-
Some authors [30] claim that their SA-like algorithms
curacies in the length of members and the dimensions
justify an exponential cooling when a particular distri-
of joints of such structures may produce unacceptable
bution is used for generating trials. Catoni [13] pro-
levels of surface distortion and internal forces. After
posed a suitably adjusted cooling schedule which is a
Salama's work, the SA has been employed in at least
function of a ®nite amount N of available computing
three other studies in shape control or accuracy by op-
time where an is given by:
timal actuators in large truss-type structures. Each of
1 them adopted di€erent optimization criteria and
an ˆ …c log N † 2 …4† annealing approaches. In order to accelerate the over-
all convergence, Chen et al. (1991) [14] used the best
However, in general, a is arbitrarily speci®ed with
solution for a starting point every time that the tem-
commonly used values between 0.95 and 0.99 as rec-
perature was reduced. The results showed that the SA
ommended in Laarhoven and Aarts [37]. There is yet
provides a computationally ecient tool to ®nd near
the possibility of making dynamic selections of Tn
optimal solutions for otherwise computationally prohi-
according to the progressively discovered features of
bitive problems. In addition, it was reported that in
the energy landscape. These adaptive cooling schedules
multiple SA runs a number of near optimal solutions
have been presented in a number of studies [37,38]
of the same quality, but corresponding to alternative
with appropriate heuristics.
designs, were identi®ed. Onada [53] used the improved
Kiselyov et al. [35] proposed a SA-like algorithm
simulated annealing (ISA) [36] which is based on the
which attempts to improve the quality of the solution
GA. The ISA showed higher performance when com-
rather than reduce the annealing time. This algorithm
pared with conventional methods, such as worst-out
di€ers from the traditional ones in that ®rst a random
best-in (WOBI) and exhaustive single point substi-
¯uctuation of the feasible costs of the objective func-
tution (ESPS). In the third study, Hamernik [28]
tion is calculated and then a few random rearrange-
employed an adaptive cooling rate which is a function
ments in the state are executed. Hence, only those
of the energy of the current state and the energy and
rearrangements for which the cost does not exceed this
temperature associated with the last accepted state.
random ¯uctuation are accepted.
The results showed that, although SA selects subopti-
Most of these authors claim that their cooling sche-
mal locations, it yields very near optimal solutions at a
dules are more ecient than the existing ones. On the
fraction of the numerical cost of a fully combinatorial
other hand, some authors [2] arm that cooling sche-
search.
dules, in general, cannot substantially improve the al-
Procedures other than active methods have been
gorithm's eciency. Whether these faster or
proposed to reduce the problem of shape inaccuracies
``improved'' algorithms may accomplish their promise
using SA. Kincaid [32] formulated a discrete optimiz-
is questionable since they have only been tested on a
ation problem to minimize distortion in trusses and
very small sets of problems. Nevertheless, the methods
compared SA with tabu search [22,23]. The same
reported in this section present very attractive features
author [33] used the SA in single and biobjective mini-
for parallelization.
mization of surface distortion and internal forces to
Finally, more e€ective improvements in the conver-
reduce the need for active controls. The minimal sol-
gence speed or in the accuracy of the solution of the
utions obtained in many SA runs are, on average,
SA rely either on a priori knowledge of the behaviour
64.5% better than the ones obtained by the best of
of the energy landscape or on combinations with other
two other heuristic methods for member exchange [25]
methods. However, these modi®cations are strongly
in terms of the value of the objective function.
problem dependent.
However, the processor time used in the SA runs were
on average about 38 times slower than the other heur-
istic methods.
5. Simulated annealing in structural optimization Another problem area where SA has found large ap-
plicability is in the optimum design of skeleton struc-
Reports of attempts to use the simulated annealing tures. Balling [4] applied SA to the discrete
algorithm in structural engineering ®rst appeared in optimization of three-dimensional steel frames. Balling
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 549

used a suitably adjusted cooling schedule and the nor- proved to be a viable solution with some attractive fea-
malization Boltzmann's constant k as the running aver- tures.
age of E, for which the designs are feasible. The SA Recently, encouraging results have been obtained by
was compared with the linearized branch-and-bound Bennage and Dhingra [6] which demonstrate the
(LB&B) strategy. The SA proved to be more reliable robustness of the SA, applying the method to single
and was able to obtain lighter structures than the and multi-objective optimal design of truss structures
LB&B algorithm. The LB&B algorithm is very sensi- with discrete-continuous variables. To model the mul-
tive to the neighbourhood size. However, with the suit- tiple objective functions in the problem formulation, a
able neighbourhood size it converged nearly ®ve times cooperative game theory approach was used. To
faster than the SA. Balling also investigated the possi- obtain such results a cooling schedule similar to that
bility of using approximate structural analysis in order reported in Ref. [4] was used. The di€erence in this
to reduce the large time spent in function evaluations. case is that, instead of a ®xed length for the
Although the SA was able to obtain lighter solutions Metropolis sampling cycle, an adjustable length which
in a reduced time, in this case the exact analysis increases as the temperature is reduced was adopted.
proved these solutions to be infeasible. The authors announced promising solutions to reduce
Lee and Lee [40] applied SA to minimum weight de- the computational burden for structural problems by
sign of trusses with discrete variables. Hwa and Soon using SA-gradient-based hybrid search schemes.
(1992) used the same path described in Ref. [4] and Lombardi [45] applied the SA to the optimization of
compared the SA with two traditional deterministic composite plates of 16 ply laminates considering the
methods: the gradient projection method (GPM) [29] buckling load using continuous-discrete design vari-
and the simplex method [52]. Again the SA proved to ables. In this relatively small problem it was observed
be more reliable in locating near optimal solutions, but that a large part of the design space was explored by
slower than the two other methods. In addition, an the SA algorithm in order to ®nd the optimum design.
approach using design sensitive analysis for all design Later, Lombardi et al. [46] employed the same
increments and employing exact analysis to reduce the approach to solve problems with a greater number of
error accumulated by the approximate analysis was plies (32 and 64). For these problems a much smaller
proposed. The same quality of solution when using fraction of the design space had to be investigated in
exact analysis was obtained in about 40.0% of the order to ®nd optimum designsÐan average of 11.0%
time spent when using approximate analysis. However, for the 32 ply laminate and 0.0003% for the 64 ply
the truss structure used in the numerical example was laminate.
too small to permit sound general conclusions. More The application of SA has been also investigated in
conclusive results were obtained by May and Balling the problem of sparse matrix reduction which has sig-
[49] performing parameter sensitive analysis of three- ni®cant relevance to structural designers, since the
dimensional steel frames using gradient information to analysis of structural systems is greatly dependent on
prevent the analysis of potentially poor designs. This the solution of sparse matrix equations, where the
®ltered SA achieved substantial improvement in the speed depends on the ordering of the unknowns.
convergence speed with only slight loss of robustness. Armstrong [3] used an SA to obtain pro®les and wave-
Later, Siarry et al. [57] showed that by using gradient fronts as a benchmark to test nodal resequencing al-
search direction as a local optimization step for the gorithms, once the solutions obtained by the SA
SA, it is assured that the solution found is the best in proved to be the minimal obtained. The temperature
its neighbourhood. decrement in Armstrong's SA was controlled by a vari-
Topping et al. [62] developed an adaptive search able whose formulation was de®ned by analogy to the
strategy for a bi-criteria optimization of planar truss statistical physics quantity known as speci®c heat. This
structures with discrete and continuous variables using SA, however, proved to be far too slow for general
the SA method. A number of heuristics were intro- use. Lewis [43] used the CSA for pro®le and ®ll re-
duced in the generation mechanism to perform changes duction of sparse matrices. In this case, new minimal
in the topology while the cross-sectional areas were pro®les for the matrices tested by Armstrong have
minimized. This generation mechanism was designed been obtained, simply by varying the CSA control par-
to enable the algorithm to remove the redundant mem- ameters. The drastic reductions obtained in the pro®les
bers while preventing mechanisms being formed in the validates the use of SA in some problems where the
structural behaviour. For small structures where the same matrix is used repeatedly with the computational
potential topological solutions are few, the SA may be cost being amortized over a large number solutions of
computationally expensive if compared with traditional equations. A more e€ective and innovative approach
methods. However, for structures with a large number towards the general use of SAs to the problem of
of topological possibilities, an exhaustive search could matrix reduction was presented by Dubuc [17], who
be computationally impracticable. In such cases the SA used a very fast traditional algorithm, the Cuthill±
550 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

MacKee (CM) [15] to generate the initial state for the been presented by Darema et al. [16] and Casoto et al.
SA runs. Special cooling schemes and generation [11].
mechanisms which take advantage of the character- The ®rst constraint in the distribution of the algor-
istics of the problem were also proposed. The CM-SA ithm is imposed by the sequential nature of the
in cascade obtained bandwidths smaller than the ones method. The canonical simulated annealing solves the
obtained by all traditional methods and in a€ordable optimization problem by successive transitions which
time. More details about the Dubuc's algorithm are are a combined action resulting in the transformation
presented later in this paper. of the current solution into a subsequent one. This
action involves two main steps which are sequentially
performed: the generation of a state candidate to the
subsequent solution and the application of the accep-
6. Parallel simulated annealing tance criterion.
Parallel transitions described herein are referred to
As it was explained in the previous section, the main those which require global information. The generation
obstacle in the general application of the SA to struc- mechanism consists of performing small changes in the
tural engineering optimization is its high consumption state and it is therefore possible to implement without
of computational time. Parallel processing appears to substantially modifying the optimization process.
be the only viable way to substantially speed up the Thereafter, for transitions using multiple state changes,
method and thus expand its applicability. For fast tai- the accurate application of the acceptance criteria
lored SAs, parallel implementations may also reduce would require the evaluation of the objective function
the loss of robustness. considering the combined e€ect of multiple changes.
The best parallel scheme is still the object of current However, such an approach is unlikely to take full ad-
research, since the ``annealing community'' has so far vantage of the parallel processing capabilities.
not achieved a common agreement with regards to a Alternatively, concurrent changes may be accounted
general approach for the serial SA. However, the per- for using approximations in the acceptance criterion,
formance of parallel SA schemes for structural engin- but this will a€ect the behaviour of the algorithm con-
eering optimization are likely to be problem dependent siderably. In addition, a mechanism to correct cumu-
as is the case with the serial ones. In addition, the lative errors and a rigorous control to assure
choice of parallel schemes largely depends on the convergence are required. The parallel transitions may
architecture capabilities of the parallel hardware avail- adopt approximate and/or accurate acceptance criteria
able. Nevertheless, the exhaustive investigation of poss- accounting for single or multiple changes. Global infor-
ible designs for tailored parallel SA schemes would mation may be used for the assessment of one solution
require knowledge of each individual problem and is to be broadcast to all processors, single transition or
not in the scope of the present study. Instead, this for the assessment of di€erent solutions to be updated
paper investigates general SA models according to in di€erent processors, multiple transitions.
their capabilities and requirements for adaption to The parallel architectures that are currently available
structural engineering problems, so that guidelines can are usually classi®ed, according to Flynn [20], in
be produced to help structural designers make a suit- SIMD (single instruction, multiple data ) and MIMD
able choice for their speci®c problems. (multiple instruction, multiple data ) machines. Both
However, before discussing di€erent parallel SA these architectures may be further subdivided into
models it is necessary to understand the diculties in shared-memory machines (SMM) and distributed mem-
the parallel implementation of the CSA. Traditional ory machines (DMM), as suggested by Gurd [27]. For
parallelization consists of dividing the work over a convenience, this paper will refer to all parallel archi-
number of processors using distribution of data and/or tectures using Gurd's classi®cation. In DMM, global
distribution of tasks. In optimization it can be information is always obtained through interprocessor
obtained by breaking up the problem and/or the algor- communication. The frequent utilization of global in-
ithm itself. Parallel SAs using distribution of the pro- formation introduces high overheads produced by the
blem need to identify statically or dynamically either a large number of interprocessor communications and
number of operations or a number of changes in such by synchronization barriers. Synchronization is used
a way that they do not interact with the evaluation of here in the sense that parallel events are sequentially
the function. Since in structural optimization problems evaluated by a centralized control operation, thus
most time is spent on evaluations of the functions, this requiring predetermination of the length of time
approach could be very e€ective. However, this source required by each event or set of events. The use of syn-
of parallelization depends exclusively on the problem chronization operations indicates that, after accom-
and will not be discussed in this paper. Examples of plishing their tasks, some processors will be idle for a
application of this approach to cell-placement have period waiting for the de®nition of new tasks from the
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 551

central control of the program. The programming pro-


blem, then, is to minimize the number of synchroniza-
tion and communication phases while the architectural
problem is to minimize their durations. From the pro-
gramming point of view, in order to reduce the num-
ber of communications, parallel SA schemes may use
either only serial transitions, only parallel transitions
or alternatively both of them. Serial transitions are the
ones that use only information from one processor to
perform a transition, while parallel transitions use in- Fig. 3. Parallel SA scheme using independent and identical
formation from more than one processor. Serial tran- processing.
sitions are the usual transitions associated with
simulated annealing. Thus, parallel SA schemes may
from the stochastic nature of the method which gener-
have communication after each change (only parallel
ally requires a number of runs to give reliability to the
transitions); communication after a number of changes
results. Thus, the computational time of n serial runs
(serial and parallel transitions); and no communication
may be cut to the time of slowest among the n pro-
during the optimization process (only serial tran-
cessors.
sitions).
SAs, in general, ®nd relatively good solutions in
In addition, parallel schemes using only parallel
quite a short time but spend much time re®ning them.
transitions will, in general, require strong synchroniza-
Using IIP, it is expected that the global search on
tion at every transition. On the contrary, schemes
di€erent processors uses a di€erent path, since they
using only serial transitions may not require synchroni-
start from di€erent positions in the sample (di€erent
zation at all. Thus, the designer only has to decide
random seeds). Therefore, improvements may be made
whether it is more advantageous to use a synchronous
in the convergence speed by using fast SAs which are
or an asynchronous regime, on the schemes which have
based on shorter global optimization periods without
parallel transitions. In an asynchronous regime state
loss of robustness.
transitions are evaluated simultaneously and indepen-
At high temperatures (global optimization) the al-
dently. Units continuously generate state transitions
gorithm has high percentages of accepted moves, i.e.
and accept or reject them on the basis of information
changes in the state, between decrements of the tem-
that is not necessarily up to date. In a synchronous
perature. As the temperature reduces only moves
regime, sets of state transitions are scheduled in succes-
which improve the solution tend to be accepted. Those
sive trials. After each trial, the accepted state transition
percentages gradually drop until stabilization at values
is communicated through the network.
equal to or around zero towards the end of the pro-
Finally, as the length of the communication path is
cess. Thus, it is also possible to reduce the period of
proportional to the diameter of the network of pro-
global optimization in each processor using adaptive
cessors, from the architectural viewpoint, the designer
global controls based on the accepted and/or rejected
has to determine the topology of the network to be
moves from all processors. This approach is particu-
adopted and the maximum number of processors to
larly suitable for SMM and is restricted to a small
use. Such a decision has to take into account the opti-
number of processors. Otherwise, a large number of
mal time coecients between computations and com-
communications would be required to update the
munications.
values of global control parameters.
6.1. SA models with only serial transitions

The simplest way and the only possible way to par- 6.2. SA models with only parallel transitions
allelize the SA in a DMM, without requiring processor
communication and synchronization throughout the In an interdependent optimization process, once
optimization process, is by placing copies of a serial di€erent changes are concurrently performed on di€er-
SA on a number of processors (Fig. 3). The only com- ent processors and the e€ects on the energy state are
munication between processes is to distribute the initial locally evaluated, the question is ``when and how to
control parameters to all nodes and gather the ®nal use global information in order to speed up the pro-
results to report to the user. Although this technique, cess?''
known as independent and identical processing (IIP), This section discusses parallel SA models which use
allows an implementation rigorously faithful to the global information every time the state is changed. In
serial algorithm, it is unlikely to produce signi®cant this case, only parallel transitions will occur. The main
speedup. The motivation behind this model comes di€erence between this class of parallel SAs lies in the
552 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

way in which global information is used in order to


apply the acceptance criterion.
The application of the acceptance criterion requires
the energy values for two subsequent states or con-
®gurations. When changes and evaluations are per-
formed independently and locally in di€erent
processors, values of energy of global states are not
available and local states have to be used instead.
However, as in the SA many changes are generally per-
formed and rejected for each transition, parallelism
may be employed to generate a population of states
which is used to exploit good solutions before applying
the acceptance criterion. Global information is
required just to select the best of the local states to be
used by the acceptance criterion and to update the sol-
ution.
Fig. 5. Regime of operation of the processors. (a) Scheme
The implementation of these schemes is based on a
All_On. (b) Scheme Turn_On.
master task which controls the process and the number
of slave tasks which generate new states and evaluate
them (Fig. 4). A series of implementations using this bottleneck in the message ¯ow caused by the need for
approach has been presented by Lee and Stiles [39]. In a task controller.
such implementations the processors may operate in a Fig. 6 shows a similar approach which is known as
permanent regime (All_On ) or they may be turned on a clustering algorithm (CA). In this approach the cur-
as the optimization process proceeds (Turn_On ), as is rent state is updated in all processors every time that
shown in one solution is accepted in any of the processors.
Fig. 5. As the processors communicate after every ``Clustering'' refers to the action of combining two or
move, the Turn_On implementation attempts to reduce more processors to generate a single Metropolis
the number of communications and redundant work Sample (MS) simultaneously. In order to prevent erro-
during the beginning of the process when the percen- neously calculated transitions in SMM, a mechanism is
tage of accepted moves is still high. introduced to communicate to all processors that a
These schemes have quite a simple implementation modi®cation has been accepted and that they must
but have a very low degree of parallelism. They are abort their processes before replacing the current sol-
suitable for SMM or for DMM with very few pro- ution by the accepted one. Despite the fact that the
cessors because these schemes are among the ones CA produce a larger number of communications than
which produce the largest number of communications. the DA during the period of high temperatures, it is
They preserve the convergence properties of the
method by using only accurate evaluations of the
energy function. However, they violate the serial de-
cision sequence of the SAs because they do not con-
sider the combined e€ect of changes. As shown in Fig.
4 the problem, from the architectural viewpoint, is the

Fig. 4. Parallel SA scheme evaluating a population for a Fig. 6. Concurrent serial SAs communicating every accepted
single transition. move.
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 553

normally more ecient since it does not require strong


synchronization.
Another form of exploiting parallelism is by using
the concurrency technique known as speculative compu-
tation [10]. Along with the SAs using IIP, the SAs
using speculative computation are the only algorithms
which faithfully reproduce serial models in DMM.
Speculative computing is also the only way of perform-
ing parallel transitions without violating the serial de-
cision sequence of the SAs. The concept is to perform
the task before knowing whether it is required or not.
If the work is eventually required, it is already done.
Otherwise, the work is wasted. Parallel SA schemes,
using speculative computation, can be implemented in Fig. 8. Examples of unbalanced tree with 15 processors for
tree-like architectural topologies where each node speculative computation.
evaluates states incorporating changes of all previous
nodes. A balanced tree will contain all possible sol- high speedups while preserving all important features
utions within n sequential changes with n being the of the serial SA models.
number of levels in the tree (Fig. 7). However, only the All previously discussed schemes adopt an accurate
solution corresponding to the correct speculation will acceptance criterion with regards to the theoretical
return to the master processor, avoiding in this way assumptions for the convergence of the method. In
bottleneck problems. such schemes, parallel transitions, in spite of searching
In highly constrained engineering problems the per- in di€erent neighbourhoods, always start from a single
centage of accepted changes in the total attempted is point (single transitions). An alternative approach
generally very low. In such a case, unbalanced trees would be transitions starting from di€erent points
are likely to be more ecient. The design of optimal moving to di€erent points (multiple transitions).
trees will then depend on the expected percentage of However, in the case, transitions do not have a totally
accepted changes from the total attempted in a serial independent direction, as it happens in the IIP,
run. Fig. 8 shows some examples of unbalanced trees since these use global information to update the sol-
which may be used to obtain more ecient systems. utions. In applying the acceptance criterion they
The ®rst tree explores a probability of only one change attempt to approximate a value for the energy of a
being accepted in ®ve attempted, while the other global state as a function of the energy of local states.
assumes a very low acceptance probability during the Although this approach has scope for maximum paral-
whole process. In some parallel hardware, which lelism, it has some inconveniences which may substan-
allows dynamic con®guration of the topology, di€erent tially restrict its application to structural engineering
trees may be used in a single parallel run to match the problems.
demand of the di€erent phases of the process. The Engineering problems using approximated assump-
schemes using speculative computation may obtain tions are in general heavily constrained and with highly
discontinuous space, which makes it very dicult to
obtain good evaluators. Therefore, histograms and
statistical measures of local evaluations may have
questionable signi®cance. Heavy penalties for con-
straint violations may present a ®tness barrier to the
convergence to a solution from the direction of the in-
feasible region. In functions with many local minima
of similar ®tness, the convergence may not occur
within a€ordable computational time.
Another source of inaccuracy in the acceptance cri-
terion may be encountered in an approach known as
the error algorithm (EA) [37]. The error algorithm may
be described as an entire network of processors locally
generating candidate states and using global states,
which are successively updated with the common mem-
Fig. 7. An example tree with 15 processors for speculative ory, as a current state. The notation ``error'' refers to
computation. the fact that the algorithm allows erroneously calcu-
554 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

lated transitions. There is no con¯ict between pro- The ®rst general applicable approach using this type
cessors and whenever a move is locally accepted it is of scheme is known as the division algorithm (DA) [37].
immediately updated in the global memory, regardless This approach uses all available processors running
of the moves already computed. In addition, this independently serial SAs until all systems reach equili-
approach introduces chaos into the calculations of the brium at a certain temperature, i.e. during one MS.
energy states because the global state is the last Thus, every time that the temperature is reduced, there
updated but not necessarily better than the previous is a synchronization among processors and the state
current local state. Calculations of error tolerance have associated with the best solution from all processors is
to be introduced in the acceptance criterion and mech- used to restart another MS.
anisms to correct accumulated errors are, in general, A good compromise would be starting with the DA
required to allow convergence. The applicability of this during the regime of high temperatures, when the num-
last method in engineering optimization, is therefore ber of accepted moves is relatively high, and dynami-
highly questionable. cally switching to the CA for the regime of low
For DMM, schemes using only parallel transitions temperatures.
tend to require a larger number of communications
than other schemes. Their ratios between instructions
and communications need to be relatively high in 7. Massively parallel simulated annealing
order to produce signi®cant speedup. However, in
some cases they may still be quite ecient for engin- There are few examples of massively parallel im-
eering problems which have function evaluations plementations in the literature [8,11,21,65], applied
requiring a large number of instructions. either to placement problems or shape detection. In
both cases the approaches are always based on func-
6.3. SA models with serial and parallel transitions tional distribution, which is problem dependent, and
on error algorithms the suitability of which is question-
In an attempt to reduce communication overheads, able.
a number of SA schemes have been proposed where a The other schemes proposed until now have been
number of steps using only local information are inter- limited to the implementations on small-scale parallel
calated between two steps using global information, machines. Most of the parallel implementations follow
i.e. di€erent processors concurrently run serial SAs the sequential algorithm strictly or with minor devi-
communicating from time to time to allow parallel ation. There are limitations in computing parallel
transitions (Fig. 9). This category of parallel SA moves in the length of MS, due to constant overlap-
schemes is very popular, since they balance a substan- ping in the sample. The iterative improvement beha-
tial degree of parallelism with satisfactory accuracy in viour inherited by the SA seems to be unsuitable for
the solution. massive implementations. In a network with a very

Fig. 9. Concurrent serial SAs communicating from time to time.


J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 555

large diameter, the gain in speed due to parallelization


may not be sucient to compensate for the high over-
heads in routing messages and/or synchronization.
Parallelization of the discussed SA schemes in large-
scale parallel machines will be able to make use of
only a small number of the available processors. The
question which arises is how to make more ecient
use of the available parallel hardware. The answer
points towards the multiple sources of parallelism,
combined eciency criteria and possibly the develop-
ment of schemes based on the parallel hardware rather
than on the sequential algorithm.
The use of speculative computation may be an
alternative solution to increase the applicability and
eciency of large-scale machines. Recently, Sohn [58]
presented performance results of a parallel SA using Fig. 10. Multiple speculative trees for large-scale parallel ma-
speculative computation to solve the TSP (travelling chines.
salesman problem). The network consisted of 100 pro-
cessors arranged in an unbalanced tree topology as by means of di€usion. The neighbourhoods may use
shown in Fig. 8a. Sohn reported that good perform- di€erent cooling schedules setting distinct global or
ances were obtained using up to 10 processors. local optimization roles for di€erent neighbourhoods.
Thereafter, the increase in the number of processors As the di€usion mechanism may take a number of
was still able to speed up the optimization process yet transitions to transfer information to the opposite net-
the eciency of the network became very poor. The work ends, the neighbourhoods are allowed to re®ne
performance of this approach is also dependent on the temporarily di€erent solutions or minima. This
initial temperature and the size of the problem, hence approach has potential for powerful search, but it
the higher the initial temperature the lower the e- requires a study of appropriate control parameters
ciency. The reason for the weak performance of the which are expected to be di€erent from sequential al-
large unbalanced trees is that the maximum number of gorithms. It also has limitations in the speedup that
processors that can be e€ectively used in one transition can be expected by increasing the number of pro-
is equal to the number of moves required to perform cessors. However, there is no limitation in the number
such a transition. In addition, the higher the tempera-
ture of the system the smaller the number of trials
between two subsequent accepted moves. The maxi-
mum speedup obtained by Sohn's SA was 20 times on
100 processors.
The current study suggests that further speedup may
be obtained by combining speculative computation
with other sources of parallelism. Fig. 10 shows an
example of implementation that combines two sources
of parallelism. The parallelism presented by the div-
ision or clustering algorithms in generating simul-
taneously an MS and the parallelism of speculative
computation by employing many processors to gener-
ate a single transition. In large DMM, it may be poss-
ible to con®gure the network topology in a number of
independent speculative trees which communicate
between each other to generate simultaneously the
same MS. Each of the speculative trees would then be
used to speed up the transitions.
Alternative approaches may be borrowed from par-
allel GA models [63]. Thus, MS may be generated by
overlapping neighbourhoods of network nodes
(Fig. 11). In this approach processors are only allowed Fig. 11. Parallel SA model with overlapping neighbourhoods
to communicate with their direct neighbours and infor- of ®ve processors transferring information over the network
mation will pass from one network region to another by di€usion.
556 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

of processors, since the increase in the number of pro- chronous update procedure is required before each
cessor does not represent communication overheads. evaluation of state change. The synchronous update
procedure uses some form of timing after which a new
value is determined simultaneously for all processors.
8. Parallel hybrid simulated annealing Synchronization is used in the sense of locking pro-
cesses in one or more network nodes until their depen-
A number of hybrid schemes using simulated anneal- dencies are acquired. In SMM, synchronization is
ing combined with some local optimization techniques made by an arbiter which controls the execution of all
has been presented in the literature [5,48,55]. These processors and decides how handed-in information will
local optimization techniques are used to re®ne the sol- be used for updating values. The long waiting time
utions or to generate better than random solutions. spent while processes are locked leads to a loss of e-
Although they may be able to speedup the sequential ciency. The possibilities for relaxing synchronization
algorithm, they tend to reduce the scope for paralleli- depend to a large extent on how much error in the
zation. convergence and in the ®nal solution may be tolerated.
Other hybrid approaches are based on combinations In engineering problems these tolerances are generally
of the SA with genetic algorithms. Mahfoud and small.
Goldberg [47] proposed an SA-GA approach which In DMM information is directly transferred from a
the authors called parallel recombinative simulated processor to another by messages through fast links
annealing (PRSA). Later, Lin et al. [44] proposed a when processors are connected to each other, otherwise
similar approach called the annealing genetic algor- a routing mechanism is required to handle the message
ithm. Both approaches attempt to introduce the im- passing. The communication by message passing sys-
plicit parallelism of the GA's multiple-point search tems is more time consuming than common memory
into an annealing framework. This was obtained by access. The problems of communication overheads
evolving a GA population under a thermodynamic depend on the size of the messages, on the diameter of
control. The GA's roulette wheel selection was the network of processors and on the topology selected
replaced by a Boltzmann-type selection operator or for the network interconnection. Synchronization is
acceptance criteria within a speci®c cooling process. forced, in order to allow processors to receive global
The important feature of these algorithms is that they information and update current values. Alternatively,
also inherit the capability to map naturally onto paral- arbiters may also be used, in this case though, to store
lel architectures. handle-in message and to decide when values need to
A slight di€erence in this approach was introduced be updated, allowing in this way that processors carry
by Brown et al. [9] to solve the quadratic assignment out their computations asynchronously without the
problem. Brown's hybrid parallel GA uses a simulated introduction of waiting times.
annealing algorithm for each solution attempting to
obtain local improvements in each generation.
10. Numerical examples

9. Synchronization and communication overheads For a great number of structural problems, particu-
larly those using the ®nite element method, the math-
A majority of the parallel SA schemes consists of ematical model that has to be solved results in a set of
interactive models, i.e. models in which there are top- sparse linear equations. Many algorithms which solve
down and bottom-up network communications. sparse linear equations use elimination techniques in
In SMM the information exchange between pro- combination with a reordering of the coecient matrix
cessors is performed by means of global memory. to preserve sparcity. The traditional approach for pro-
Minimization of communication overheads depends on blems where the coecient is symmetric or structurally
the choice made between shared and private memory. symmetric is to use symmetric (row and column) per-
The common memory is rather large, and its access mutations to reduce the bandwidth, the pro®le, or the
time is much longer than the processor instruction ex- wavefront of the matrix.
ecution time. In addition, overlapping in accessing The results obtained using the SA have been com-
memory banks decreases the memory access speed. pared favourably with all traditional methods in redu-
Each processor has its own fast cache that should pre- cing the matrix bandwidth. However, the SA has been
ferably be used to reduce the number of accesses to considered far too slow for general use. The ®rst
the common memory. SAs make use of the common reason for the poor performance of the SA is because
memory to store global states to be accessed by all it uses a larger solution space than traditional
processors. However, in order to prevent inconsistent methods. SA reduces matrix bandwidth by randomly
state views among the processors, an expensive syn- selecting two coecients and exchanging their rows
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 557

and columns in the matrix. The number of allowable umns describe the degree of sparcity of the matrices;
moves nam, for a square matrix of order N, is given by: the number of entries in the lower triangular part of
the matrices including the diagonal; and the percen-
A…A ÿ 1† tages of non-zero coecients in the matrices. The
nam ˆ , A ˆ …N ÿ 1†…N ÿ 2† …5†
2 bandwidths obtained by the Cuthill±MacKee algor-
ithm and by the SA are shown in columns ®ve and six.
However, a move may only result in a reduced band-
It is important to point out that these problems pre-
width if it involves one of the noc o€ensive coecients
sent in general a large number of local minima. There
which are increasing bandwidth. Thus, there is a upper
may also be a large number of possible moves which
bound num for the moves which may e€ectively reduce
will not directly a€ect the bandwidth size, although
the matrix bandwidth given by:
these moves may have great signi®cance for further
num ˆ noc …A ÿ 1† …noc <N † …6† bandwidth reductions. These cannot be a priori deter-
mined. The optimization process will wander among
For medium to large matrices nam is much bigger than equivalent states which makes it very hard for algor-
num and a blind search results in an extensive search in ithms to ®nd an appropriate search direction. Many
ine€ective regions. Dubuc [17] proposed a tailored but important intermediary states are lost due to the poor
very ecient serial SA algorithm for bandwidth re- capabilities of the evaluator which is used to access the
duction which uses this a priori knowledge. This suitability of the solution.
author developed a mechanism to identify possibly
e€ective moves, incorporated in the SA generation 10.1. The serial algorithm
mechanism. This ecient generation mechanism com-
bined with adaptive cooling schedules results in faster The serial SA model is extensively explained in Ref.
and more e€ective algorithms which are used in this [17] and it is not the intention of this paper to overlap
paper as serial models for parallelization. In the cur- with the previous work. However, it is important to
rent work two modi®cations were performed to the explain brie¯y parts of the algorithm that a€ect the
Dubuc's original serial algorithm in order to make it selection of the parallel implementation.
consistent with parallel versions. The way in which The main aspect to be considered in the choice of
random numbers are generated in the original algor- the parallel implementation is the generation mechan-
ithm is machine dependent and had to be modi®ed. ism. The improved generation mechanism, proposed
The second modi®cation was the introduction of a by Dubuc, always uses one of the coecients which is
stop criterion in terms of the ratio between number of found at the extremity of the bandwidth for the (row
accepted moves and the total number of attempted and column) permutations. Starting from the upper-
moves. This last modi®cation was introduced to take left corner, the algorithm moves through the matrix
into account the contribution of many processors to along the diagonals searching for the ®rst non-zero
generate a MS in the parallel implementation. The coecient ar,c. The current bandwidth is given by
®nal results in most of the cases were identical to the (r ÿ c ). The second coecient for the permutation is
ones obtained using the original algorithm. selected at random. Thereafter, a number of heuristics
Everstine's collection [19], consisting of 30 structural are used to determine whether the permutation would
engineering matrices, is used to investigate the per- increase the bandwidth or not. Thus, if the bandwidth
formance of the parallel implementation. These increases, the new o€ensive is scanned using a bounded
matrices were collected from US military and NASA search in the direction of the upper-left corner. The
users of the structural engineering package move is accepted or rejected according to a certain
NASTRAN [51] for use as a benchmark test for band- probability. However, if the bandwidth does not
width reordering algorithms. Table 1 presents some in- increase, the move is automatically accepted and the
formation about the patterns and minimal bandwidths new non-zero coecient is searched for along the diag-
of these matrices. In the ®rst two columns in Table 1, onals from the position where it has previously
borrowing the jargon from the graph theory, are the stopped. There may be a number of non-zero coe-
order N of the matrices and the maximum degree, i.e. cients which are causing the same maximum band-
the maximum number of connectivities of a generic width. In order to prevent bias in the choice of one of
node. As the length of an MS and the stop criteria are these coecients, the diagonals may be considered as
functions of the order of the graph corresponding to rings, as shown in Fig. 12. Thus, after transversing a
the matrix of dimension N  N, the execution time is diagonal, the algorithm moves to a random position in
highly dependent on the value of N. In addition, for the next diagonal and travels along it forward or back-
matrices with high degree of connectivity the band- ward with equal probability.
widths tend to reduce by very small decrements with a This improvement in the generation mechanism, pro-
higher number of accepted moves. The next two col- posed by Dubuc, results in more ecient search and
558 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

Table 1
EverstineÂs matrix collection: patterns and bandwidths

Bandwidth

Matrix order Maximum degree Symmetric entries Fill (%) Initial CM SA Description

59 5 163 7.67 25 8 6 2D frame


66 5 193 7.35 44 3 3 Truss
72 4 147 4.28 12 7 5 Grillage
87 12 314 7.15 63 18 10 Tower
162 8 672 4.50 156 16 12 Plate with hole
193 29 1843 9.38 62 44 32 Knee prosthesis
198 11 795 3.55 36 12 8 Reinforced mast
209 16 976 3.99 184 33 23 Console
221 11 925 3.34 187 15 13 Hull-tank region
234 9 534 1.52 48 25 11 Tower with platform
245 12 853 2.43 115 40 21 Carriage
307 8 415 0.55 63 37 29 Power supply housing
310 10 1379 2.55 28 13 12 Hull with re®nement
346 18 1786 2.69 318 47 30 Deckhouse
361 8 1657 2.26 50 15 14 Cylinder with cap
419 12 1991 2.03 356 32 26 Barge
492 10 1824 1.30 435 31 17 Piston shaft
503 24 3265 2.38 452 54 41 Baseplate
512 14 2007 1.34 73 32 21 Submarine
592 14 2848 1.46 259 41 29 CVA bent
607 13 2869 1.39 147 52 39 Wankel rotor
758 10 3376 1.04 200 29 20
869 13 4075 0.96 586 41 31
878 9 4163 0.97 519 34 25 Plate with insert
918 12 4151 0.88 839 50 32 Beam with cutouts
992 17 8868 1.70 513 53 37 Mirror
1005 26 4813 0.85 851 102 56 Baseplate
1007 9 4791 0.85 986 34 28
1242 11 5834 0.68 936 92 56 Sea chest
2680 18 13,853 0.35 2499 69 54 Destroyer

substantial speedup but reduces the room for paralle-


lism because a pair of coecients can only be gener-
ated after evaluating the e€ect in the bandwidth
caused by the previous permutation.
A second approach was also proposed by Dubuc
[17]. This approach uses the traditional Cuthill±
Mackee, which is one of the fastest algorithms for
bandwidth reduction, to generate initial solutions for
the SA. Thereafter a quasi-quenching cooling schedule
is performed in cascade. Dubuc shows that such an
approach produces, in general, better solutions than
other traditional algorithms in equivalent amounts of
computational time.
In this paper some of the parallel models described
will be investigated for the parallelization of the two
algorithms proposed by Dubuc. The problem of mini-
mization of the bandwidth of sparse matrices was cho-
Fig. 12. Mechanism to identify an o€ensive coecient in a sen as an example because it exhibits some of the
sparse matrix. inconveniences which often make parallelization a di-
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 559

cult task. In this type of problem the function evalu- worker tasks. On the other hand, if the data are bro-
ation (checking the bandwidth) is very fast and the ken into small pieces to be passed in parts via small
time between communications is very short. bu€ers, the communication process becomes more time
consuming. Thus, the routers are only responsible for
10.2. The hardware setup high priority and short communications between
workers and an arbiter in the master task which is re-
The parallel hardware consists of a 486-PC host and sponsible for selecting the processor that will update
a number of TRAMs (transputer module) supported the others. The updates will be communicated directly
by TTM3 daughterboards. The basic component is an to worker tasks through the two other processor links,
IMST800 transputer of 25 MHz speed. It is a 32-bit from the chosen processor to the others. This ®rst
microprocessor, with 1, 2, 4 and 8 Mbytes of dynamic hardware setup was used to investigate the speedup
RAM, 4 Kbytes of on-chip fast static memory and due to increases in the number of processors, but the
FPU (¯oating point unit), delivering 30 MIPS and 4.3 limited amount of memory does not allow the ex-
M¯ops peak performance. Communication between ecution for large matrices.
transputers is supported by four bi-directional, serial Alternatively, a second hardware system consisting
links operating at 20 Mbits/s. of a PC host, an 8 Mbytes root processor and three
The available hardware has been arranged in two slaves with 4 Mbytes memory RAM in each, was used
independent distributed memory systems. The ®rst to obtain new minimum bandwidth values for the
setup consists of the host 486 PC, a root processor of large matrices.
2 Mbytes of dynamic RAM and 11 slave processors of
1 Mbyte RAM. The 12 processors were connected in a 10.3. The parallel implementations
pipeline topology using the four available physical
links to handle message passing as shown in Fig. 13. A total of seven parallel implementations were used
The two ®rst links are used to connect a router task to for the comparisons of the numerical examples. The
two other adjacent routers in the pipeline. The router ®rst ®ve implementations have the modi®ed Dubuc's
task stores the messages in an internal memory to pro- SA as the serial model for the parallelization. For the
tect the messages from being erroneously overwritten other two parallel implementations, the SA in cascade
in the worker task. However, the limited amount of with the Cuthill±MacKee algorithm were used as the
memory does not permit a large volume of data to be serial model.
duplicated in internal memories of the router and The SA's ergodic property and statistical guarantee

Fig. 13. The hardware setup and task distribution in a pipeline topology.
560 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

of ®nding a global minimum rely on the assumption of mented on the master task and broadcast to the slave
time t 4 1. In practice, there is a trade-o€ between tasks. This value is then used to decide whether the
the computational time and the quality of the solution. subsequent arriving solution has considered the pre-
The direct consequence is that di€erent solutions are vious move or not. If the counter value in the message
often obtained from independent runs, when the ex- arriving from a slave task is equal to the one in the
ecution starts from di€erent points of the sample. In master task, the solution is accepted and communi-
addition, the runs which are faster in identifying the cated to all slaves. However, if the value in the arriving
valley of a global minimum also tend to be faster to message is smaller than the one in the master task, it
re®ne the solution. The reliability of the solution is means that the message was accepted in the slave task
then based in a few runs which can be performed con- before it has received the latest changes. The message
currently. Thus, the ®rst parallel implementation which is then disregarded.
is investigated in this section is an example of IIPSA This scheme produced results with much better qual-
as described in Section 6.1. In order to obtain ad- ity than the previous two schemes. This algorithm
ditional speedup, the cooling schedule in the IIPSA exhibits slightly di€erent behaviour than the serial SA.
adopts a slightly higher freezing point than in the serial It ®nds the neighbourhood of the global optimum
SA, since it is expected that some processors reach the quicker than the serial algorithm but has more pro-
solution in shorter time than others. blems in re®ning them in the last stages. There is often
The second parallel implementation used in this sec- a loss of one or two units in the accuracy of the ®nal
tion consists of slave tasks on all processors evaluating bandwidth. The frequency of the communications is
the e€ect of one move and informing the master task gradually reduced towards the end of the process,
on the root processor of the results. The master task since accepted moves become less likely to occur. The
will then select one of the moves to be performed by messages in this scheme are practically the same size of
all slave tasks using two di€erent selection criteria in the messages in the two previously implemented
alternative implementations. In the ®rst implemen- schemes but the number of communications is substan-
tation (PSA1), the state which results in a minimal tially reduced. This approach results in a speedup for
bandwidth is selected. In the other implementation the ®rst few processors. However, this speedup is
(PSA2) there is a random selection among the accepted rather due to the new features of the parallel scheme
moves, if more than one move is accepted, or a ran- than to the hardware acceleration. If more processors
dom selection among all attempted moves when none are used the frequency of communication increases
of the moves are accepted. The accepted move is used and the improvements which may be gained by carry-
to obtain the same current state on all processors and ing out the search concurrently on di€erent parts of
each of these then evaluates another independent the sample are not sucient to compensate for the
move. The messages, in this case, are short in length overheads in routing massages and bottleneck pro-
but are large in number. They consist of the minimal blems. However, a limitation of four processors was
information which is required to allow the slave tasks observed in order to assure a suitable MS length for
to perform concurrently the same move. The execution small to medium matrices. Although, for large
time for these implementations increases with the matrices, one or two more processors may be used
increments in the number of processors. The results without loss in the quality of the solution, this is unli-
start to deteriorate after three processors. This negative kely to produce further speedup since there is an
performance of such implementations was expected in increase in communications overheads due to a bottle-
this problem since the time to evaluate a change is neck problems. Contrary to what normally occurs in
very short and the overheads in communications and di€erent serial runs for a same problem, the di€erence
comparisons among solutions are very high. The loss between execution time of parallel runs using di€erent
in the quality of the solution in many problems occurs starting points and di€erent random seeds is minimal.
because the competition between processors results in In addition, the same minimal or near minimal values
powerful local optimization but reduces the capability were obtained in ®ve independent runs for each pro-
of the algorithm of escaping local minima. The results blem in 27 of the 30 problems. The average execution
obtained by PSA2 were, in general, better than the times of the ®ve serial runs were compared with the
ones obtained by PSA1 in terms of quality of the sol- average of other ®ve parallel runs for each problem.
ution. The speedups obtained using this scheme for 29 pro-
An approach similar to that used by the CA blems are shown in Table 2. It was not possible to
described in Section 6.2, was also implemented for this measure the speedup for the last matrix due to a limi-
small distributed memory system. In this implemen- tation of memory. The minimal bandwidths for the
tation the accepted move which is ®rst communicated last matrix were obtained in simulations of the parallel
to the master task is used to update the solutions on schemes implemented in a SUN/Sparc10 workstation.
all processors. The counter of accepted moves is incre- The next parallel SA scheme which was implemented
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 561

Table 2
Minimal bandwidth and speedup of parallel implementations

Speedup

Clustering A Division A

Number of Processors

Matrix order Minimum bandwidths 2 3 4 2 3 4

59 6 1.216 1.233 1.242 ÿ 1.101 ÿ 1.134 ÿ 1.182


66 3 1.182 1.319 1.412 1.131 ÿ 1.112 ÿ 1.143
72 5 1.372 1.491 1.648 ÿ 1.119 ÿ 1.162 ÿ 1.195
87 10 1.100 1.197 1.208 ÿ 1.081 ÿ 1.097 ÿ 1.122
162 11a 1.047 1.065 1.115 ÿ 1.032 ÿ 1.037 ÿ 1.070
193 32 1.017 1.016 1.085 1.017 ÿ 1.028 ÿ 1.046
198 8 1.048 1.197 1.201 ÿ 1.008 ÿ 1.089 ÿ 1.172
209 23 1.038 1.118 1.209 1.045 1.046 1.037
221 13 1.099 1.201 1.244 1.058 1.073 1.066
234 11 1.143 1.189 1.241 1.091 1.112 1.102
245 21 1.122 1.212 1.306 1.067 1.099 1.126
307 28a 1.092 1.167 1.192 1.045 1.113 1.140
310 12 1.123 1.195 1.271 1.094 1.141 1.167
346 29a 1.100 1.178 1.219 1.033 1.078 1.107
361 14 1.088 1.153 1.195 1.103 1.138 1.155
419 25a 1.163 1.225 1.290 1.056 1.097 1.134
492 17 1.091 1.156 1.198 1.111 1.172 1.181
503 41 1.113 1.167 1.200 1.135 1.162 1.173
512 21 1.177 1.256 1.303 1.094 1.146 1.187
592 29 1.114 1.162 1.193 1.132 1.159 1.180
607 38a 1.083 1.142 1.190 1.101 1.173 1.174
758 20 1.107 1.176 1.223 1.082 1.131 1.163
869 31 1.112 1.166 1.202 1.113 1.152 1.189
878 24a 1.151 1.203 1.254 1.092 1.164 1.128
918 32 1.108 1.139 1.182 1.130 1.179 1.211
992 35a 1.063 1.093 1.145 1.165 1.193 1.214
1005 28 1.107 1.176 1.236 1.103 1.144 1.167
1007 37 1.123 1.182 1.214 1.110 1.162 1.173
1242 56 1.137 1.177 1.201 1.101 1.148 1.181
2680 52a ± ± ± ± ± ±
a
New minimal bandwidths obtained in runs using the DA scheme.

is an example of the DA, described in Section 6.3, for was performed by the processor with the best move
DMM. In this implementation each processor runs an would be very long. Instead, the entire state array and
entire and independent MS between two communi- all relevant data from this processor are passed to the
cations. The time between communication, i.e. the dur- other processors. Thus, this scheme also presents a
ation of one MS, is de®ned either by a speci®ed reduced number of communications when compared
number of accepted moves or by a upper bound value with the algorithms PSA1 and PSA2. However, the
of attempted moves, whichever is achieved ®rst. After size of the messages may be much larger, since in this
a slave task has communicated its results to the master case the messages contain the entire set of values corre-
task, it waits for the decision of the master task con- sponding to the current state. This scheme also showed
cerning the processor with the best move. This pro- some speedup, but only for the matrices of medium to
cessor then broadcasts throughout the network its high order. For small matrices, the frequency of com-
values, which are used as starting parameters for a munication is too high to allow any speedup. Even for
subsequent MS. For sparse matrix problems the time the large matrices, this scheme was still slightly slower
to reverse the accepted moves in the other processors than the previous one but the quality of the results in
and repeat the same sequence of accepted moves which general are equal or better than the serial one. The
562 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

limitation to a few processors remains. In addition, the speed and quality of the solution. Depending on the
more processors are used the shorter the time between selected cooling schedule these algorithms may present
communications. The results obtained using this a behaviour di€erent from the serial model. In some
scheme are shown in Table 2. applications, such a behavioural transformation may
The two last approaches were used to parallelize the enhance the algorithm's capabilities. For engineering
SA part of the Dubuc's CM-SA running in cascade. In problems these schemes have potential for ``super-
this case, the cooling starts from a much lower tem- linear'' speedup because the time to evaluate a function
perature than the conventional SA. The accepted is in general much longer than the time to broadcast a
moves become less frequent and the performance of solution. There is, however, a limitation in the number
the parallel schemes improves in terms of speed. The of processors which is in general determined by a mini-
quality of solution also improves when compared with mum length of sampling required for an MS. If the
the serial CM-SA using equivalent running times. For MS period becomes very short, the quality of the sol-
the majority of the cases the minimum solution was ution deteriorates. Additional speedup may possibly be
obtained in computational time compared with the tra- obtained by the use of speculative computation.
ditional methods which are used in practice. Parallelization appears to be the only general
approach to improve the performance of the SA and
expand its applicability to engineering optimization.
11. Concluding remarks

The problem of bandwidth reduction used in the nu- 11. Unlinked References
merical examples shows some critical diculties for the
parallelization of the SA. The time which is spent to ref [1, 12, 31, 34, 42, 61, 64]
generate and evaluate a new solution is in general
shorter than the time to communicate a solution to
master task, to deliberate about the solution and com-
municate the decision back to the network nodes. The Acknowledgements
size of the messages may be relatively large. In ad-
dition, the deterministic generation mechanism of the The authors acknowledge, with grateful thanks, the
selected serial model reduces the degree of parallelism work of Eduardo Dubuc who developed the serial SA
and the options for parallel implementations. models employed here and kindly lent them for the
However, for the great majority of the engineering pro- development of the current study. The ®nancial sup-
blems, function evaluations are likely to be very com- port from Brazilian Ministry of Education, CNPq, is
putationally expensive when compared with the time also gratefully acknowledged.
for passing messages. In such a case, designers may
have a wider range of choices for ecient paralleliza-
tion of the SA. References
The IIPSA may not represent the most ecient util-
ization of the parallel hardware. Although some [1] Aarts EHL, De Bont FMJ, Habers JHA, Korst JHM.
speedup may be obtained by using faster cooling sche- Parallel implementations of the stochastical cooling al-
dules, these are in general not substantial enough to gorithm. Integration 1986;4:209±38.
[2] Aarts EHL, Korst JHM. Simulated annealing in
justify loss of the quality of the solution. This class of
Boltzmann machines: a stochastic approach to combina-
algorithms may be of some applicability for problems torial optimization and neural computing. New York:
where the accuracy is the determining factor in the Wiley, 1989.
selection of the parallelization scheme and when it is [3] Armstrong BA. Near-minimal matrix pro®les and wave-
important to preserve the characteristics of the serial fronts for testing nodal resequencing algorithms.
model. In the IIPSA the savings in the time are not International Journal for Numerical Methods in
sucient to make it ecient for minimizing the band- Engineering 1985;21:1785±90.
width of sparse matrix. [4] Balling RJ. Optimal steel frame design by simulated
Parallel implementations with communications after annealing. Journal for Structural Engineering
every move, such as PSA1 and PSA2 may only be 1991;117:1780±95.
[5] Baluja S. Local optimization using simulated annealing.
e€ective when the function evaluation is computation-
In: IEEE Proceedings of the International Conference on
ally expensive. They are suitable for convex problems Systems, Man and Cybernetics, 1992, 1, p. 583±588.
or problems with very few minima, since they perform [6] Bennage WA, Dhingra AK. Single and multiobjective
intensive local search but a poor global search. structural optimization in discrete-continuous variables
The DA and CA are the most attractive approaches using simulated annealing. International Journal for
for engineering problems, in terms of both convergence Numerical Methods in Engineering 1995;38:2753±73.
J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564 563

[7] Bertoni A, Dorigo M. Implicit parallelism in genetic al- ternal forces in truss structures by member exchanges.
gorithms. Arti®cial Intelligence 1993;61:307±14. NASA Technical Memorandum 101535, 1989.
[8] Bongiovanni G, Crescenzi P. A parallel simulated [26] Grefenstette JJ, Baker JE. How genetic algorithms work:
annealing for shape detection. Computer Vision and a critical look at implicit parallelism. In: Proceedings of
Image Understanding 1995;61:60±9. the Third International Conference on Genetic
[9] Brown DE, Huntley CL, Spilane AR. A parallel genetic Algorithms, 1989. p. 70±9.
heuristic for the quadratic assignment problem. In: [27] Gurd JR. A taxonomy of parallel computer architec-
Proceedings of the Third International Conference on tures. In: Proceedings of the International Conference on
Genetic Algorithms, 1989. p. 406±15. the Design and Applications of Parallel Digital
[10] Burton FW. Speculative computation, parallelism and Processors. Lisbon, Portugal: IEE, 1988.
functional programming. IEEE Transactions on [28] Hamernik TA. Optimal placement of damped struts
Computers 1985;34:1190±3. using simulated annealing. Journal of Spacecraft and
[11] Casoto A, Romeo F, Sangiovanni-Vincentelli AN. Rockets 1995;32.
Placement of standard macro-cells using simulated [29] Haug EJ, Arora JS. Applied optimal design. New York:
annealing on the connection machine. In: IEEE Wiley, 1979.
International Conference on Computer-Aided Design, [30] Ingber L. Very fast simulated re-annealing. Mathematical
Santa Clara, 1987. p. 350±3. Computer Modelling 1989;12:967±73.
[12] Casoto A, Romeo F, Sangiovanni-Vincentelli AN. A [31] Kim GH, Yang YS. Stochastic search techniques for glo-
parallel simulated annealing for placement of macro- bal optimization of structures. In: Proceedings of Korea±
cells. IEEE Transactions on Computer-Aided Design Japan Joint Seminar on Structural Optimization. Seoul,
1987;6:838±47. Korea, May, 1992. p. 87±95.
[13] Catoni O. Rates of convergence for sequential annealing: [32] Kincaid RK. Minimizing distortion in truss structures: a
a large deviation approach. In: Azencott R, editor. comparison of simulated annealing and tabu search.
Simulated annealing: parallelization techniques. New AIAA Paper, No. 91-1095, p. 424±430, 1991.
York: Wiley, 1992. p. 25±36. [33] Kincaid RK. Minimizing distortion and internal forces in
[14] Chen G-S, Bruno RJ, Salama M. Optimal placement of truss structures via simulated annealing. Structural
active/passive members in truss structures using simu- Optimization 1992;4:55±61.
lated annealing. AIAA Journal 1991;29:1327±34. [34] Kirkpatrick S, Gelatt Jr CD, Vecchi MP. Optimization
[15] Cuthill E, McKee JM. Reducing the bandwidth of sparse by simulated annealing. Science 1983;220:671±80.
symetric matrices. In: Proceedings of the 24th National [35] Kiselyov BS, Kulakov N Yu, Mikaelian AL.
Conference ACM, New York, 1969. p. 157±72 ACM Modi®cation of the simulated algorithm method for sol-
Publication P69. ving combinatorial optimization problems. In:
[16] Darema F, Kirkpatrick S, Norton VA. Parallel algor- Proceedings SPIEÐPhotonics for Computers, Neural
ithms for chip placement by simulated annealing. IBM Networks and Memories. July, 1992. 1773, p. 120±124.
Journal of Research and Development 1987;31:391±402. [36] Koakutsu S, Sugai Y, Hirata H. Block placement by
[17] Dubuc EJ. Bandwidth reduction by simulated annealing. improved simulated algorithm based on genetic algor-
International Journal for Numerical Methods in ithm. In: Proceedings of the Fifth Conference on Systems
Engineering 1994;37:3977±98. Modeling and Optimization. Zurich: IFIP, 1991. 180, p.
[18] Elperin T. Monte Carlo structural optimization in dis- 648±656.
crete variables with annealing algorithm. International [37] Laarhoven van PJM, Aarts EHL. Simulated annealing:
Journal for Numerical Methods in Engineering theory and applications. Dordrecht: Reidel, 1987.
1988;26:815±21. [38] Lam J, Delosme J-M. Performance of a new annealing
[19] Everstine GC. A comparison of three resequencing algor- schedule. In: Proceedings of the 25th ACM/IEEE Design
ithms for the reduction of matrix pro®le and wavefront. Automation Conference, 1988. p. 306±11.
International Journal for Numerical Methods in [39] Lee FH, Stiles GS. Parallel simulated annealing: several
Engineering 1979;14:837±53. approaches. In: Board Jr JA, editor. NATUG 2, transpu-
[20] Flynn MJ. Very high speed computer systems. IEEE ter research and applications, vol. 2, 19??.
Proceedings 1966;54. [40] Lee C-D, Lee WD. Optimal truss design by stochastic
[21] Gaudron I, Trouve A. Massive parallelization of simu- simulated annealing. In: Proceedings of the Korea±Japan
lated annealing: an experimental and theoretical Joint Seminar on Structural Engineering. Seoul, 1992. p.
approach for spin glass models. In: Azencott R, editor. 191±200.
Simulated annealing: parallelization techniques. New [41] Leite JPB, Topping BHV. Parallel genetic models for en-
York: Wiley, 1992. p. 163±86. gineering optimization. In: Proceedings of the Structural
[22] Glover F. Tabu search, part I. ORSA Journal of Engineering Computational Technology Seminar.
Computing 1989;1:190±206. Edinburgh: Civil-Comp Press, 1995. p. 79±95.
[23] Glover F. Tabu search, part II. ORSA Journal of [42] Leite JPB, Topping BHV. Genetic algorithm based opti-
Computing 1990;2:4±32. mizers for engineering optimization. In: CIVIL-
[24] Greene JW, Supowit KJ. Simulated annealing without COMP95, The Sixth International Conference on Civil
rejected moves. IEEE Transactions on Computer-Aided and Structural Engineering Computing: Developments in
Design 1986;5:221±8. Computational Techniques for Structural Engineering.
[25] Greene WH, Haftka RT. Reducing distortion and in- Edinburgh: Civil-Comp Press, 1995. p. 151±65.
564 J.P.B. Leite, B.H.V. Topping / Computers and Structures 73 (1999) 545±564

[43] Lewis RR. Simulated annealing for pro®le and ®ll re- [56] Salama M, Bruno R, Chen G-S, Garba J. Optimal place-
duction of sparse matrices. International Journal for ment of excitations and sensors by simulated annealing.
Numerical Methods in Engineering 1994;37:905±25. In: Recent advances in multidisciplinary analysis and op-
[44] Lin F-T, Kao C-Y, Hsu C-C. Applying the genetic timization. NASA CP-3031, 1990. p. 1441±58.
approach to simulated annealing in solving some NP- [57] Siarry P, Dupinet E, Mekhilef M. A new way to opti-
hard problems. IEEE Transactions on Systems, Man and mize mechanical systems using simulated annealing. In:
Cybernetics 1993;23:1752±67. Hernandez S, Brebia CA, editors. Optimization of struc-
[45] Lombardi M. Ottimizzazione di lastre in materiale com- tural systems and applications. Southampton:
posito con l'uso di un metodo di annealing simulato, Computational Mechanics Publications, 1993. p. 569±83.
Tesi di Laurea, Department of Structural Mechanics, [58] Sohn A. Parallel N-ary speculative computation of simu-
University of Pavia, Italy 1990. lated annealing. IEEE Transactions on Parallel and
[46] Lombardi M, Haftka RT, Cinquini C. Optimization of Distributed Systems 1995;6:997±1005.
composite plates for buckling by simulated annealing. [59] Szu H, Hartley R. Fast simulated annealing. Physics
AIAA paper, No. 92-2313-CP, 1992. Letters A 1987;122:157±62.
[47] Mahfoud SW, Goldberg DE. Parallel recombinative [60] Szu H, Hartley R. Nonconvex optimization by fast simu-
simulated annealing: a genetic algorithm. IlliGAL Report lated annealing. Proceedings of the IEEE 1987;75:1538±
No. 92002, Department of General Engineering,
40.
University of Illinois, Urbana, IL, 1992.
[61] Thornton AC. Genetic algorithms versus simulated
[48] Martin OC, Otto SC. Combining simulated annealing
annealing: satisfaction of large sets of algebraic mechan-
with local search heuristics. In: Laporte G., Osman I.,
ical design constraits. In: Gero JS, Sudweeks F, editors.
editors. Metaheuristics in combinatoric optimization,
Arti®cial intelligence in design '94. Dordrecht: Kluwer
1993 (submitted).
Academic, 1994. p. 381±98.
[49] May SA, Balling RJ. A ®ltered simulated annealing strat-
[62] Topping BHV, Khan AI, Leite JPB. Topological design
egy for discrete optimization of 3D steel frameworks.
of truss structures using simulated annealing. In: CIVIL-
Structural Optimization 1992;4:142±8.
[50] Metropolis N, Rosenbluth A, Teller A, Teller E. COMP93, The Fifth International Conference on Civil
Equation of state calculations by fast computing ma- and Structural Engineering Computing and Arti®cial
chines. Journal of Chemical Physics 1953;21:1087±92. Intelligence, Neural Networks and Combinatorial
[51] NASA. The NASTRAN theoretical manual. Optimization in Civil and Structural Engineering.
Washington, DC: NASA SP-221 (03), 1976. Edinburgh: Civil-Comp Press, 1993. p. 151±65.
[52] Nelder JA, Mead A. A simplex method for function [63] Leite JPB, Topping BHV. Parallel genetic models for en-
minimization. Computer Journal 1965;7. gineering optimization. In: Proceedings of the Structural
[53] Onoda J. Actuator placement optimization by genetic Engineerimg Computational Technology Seminar.
and improved simulated annealing algorithms. AIAA Edinburgh: Civil-Comp Press, 1995. p. 79±95 May.
Journal 1993;31:1167±9. [64] Witte EE, Chamberlain RD, Franklin MA. Parallel
[54] Radcli€e N, Wilson G. Natural solutions give their best. simulated annealing using speculative computation. IEEE
New Scientist 1990;1712. Transactions on Parallel and Distributed Systems
[55] Sakurai T, Lin B, Newton R. Fast simulated di€usion: 1991;2:483±94.
an optimization algorithm for multiminimum problems [65] Wong C-P, Fiebrich R-D. Simulated annealing-based cir-
and its application to MOSFET model parameter extrac- cuit placement algorithm on the connection machine sys-
tion. IEEE Transactions on Computer-Aided Design tem. In: Proceedings of the International Conference on
1992;11:228±33. Computer Design, 1987. p. 78±82.

You might also like