Professional Documents
Culture Documents
64
1063-9535/92 $3.00 0 1992 IEEE
microtasking by itself does not yield high speedup; it 1. but it has 64 panels ( twice as many as example 1 )
should be combined with macrotasking in order to and four supports at quarter points. The upper nodes at
achieve maximum performance. the middle of each span are loaded by a 60-kip
downward load in the vertical y-direction and by 20-kip
Autotasking loads in the x and z directions . The displacements of the
nodes at the middle of each span in the vertical y-
Autotasking is the automatic distribution of tasks direction are restricted to 1/200the of the span.
to multiple processors by compiler. It attempts to detect Example 3 is a 3126-member space trusss (
parallelism in the code automatically. Basically, it Figure 3 ). The U-shape space truss has three wings,
combines vectorization and microtasking automatically. each consisting of 64 equal-span panels in the
longitudinal direction (similar to example 2). It has 30
Multitasking algorithms simple supports as indicated in Figure 3. The loading
and displacement constraint are similar to those of
In the finite element structural analysis, most of example 2.
the time is spent in setting up and assembling the
element stiffness matrices into the structure stiMness speedup results
matrix, the assembling of the structure load vector, and
solving the system of linear equation for displacement We study the speedup due to vectorization and
degrees of freedom. Thus, we will concentrate on the multitasking for the three examples described in the
functions "assemble" that assembles the structure previous section using the algorithms A, B, and C. The
stiffness matrix, "load-vector" that assembles the theoretical speedup due to multitasking is defined as the
structure load vector, and "solve" that solves the ratio of the execution time spent by a task in a sequential
resulting linear equations. Parallel algorithms are program to that spent in a concurrent program [7].
developed and compared using autotasking, Speedup results due to vectorization and various
microtasking, and macrotasking with the objective of types of multitasking are presented for three main
improving the performance of these functions. functions of the algorithms: "assemble", "load-vector",
In this work, three different multitasking and "solve". The last function is divided into two parts:
algorithms are presented and compared. The first reduction of the stiffness matrix into the product of a
algorithm, called algorithm A, is based on the use of lower triangular and an upper triangular matrix, and
autotasking only (61. In the second algorithm, forward and backward substitutions considered together.
microtasking directives are introduced in the loop level. At the end of the paper we present speedup results for
No load balancing is used in this case. In the third complete solution of the optimization problem.
algorithm, both microtasking and macrotasking are used Figures 3 to 5 show the speedup results for
with load balancing. The three algorithms are outlined in setting up the load vector, setting up the structure
Table 1. stiffness matrix, and the forward and backward
substitutions for computing the nodal displacement
Examples vector for example 1, respectively.
Figure 3 shows that autotasking improves the
We solved three space structures using the speedup for the function load-vector substantially
algorithms outlined in Table 1. These examples model because this function consists of only nested loops
space station structures. without any dependencies among matrix elements and
Example 1 is a 526-member space truss (Figure function calls within loops. Microtasking does not
2). It consists of 32 equal-span panels in the improve the performance of this function substantially.
longitudinal direction and one square panel in the This is due to the relatively large overhead needed in
transverse direction. It has two simple supports at each microtasking (compared with the amount of the work
end and two other supports at the middle of the span. done) and poor load balancing which in turn deteriorates
Thus, it is a symmetric continuous two-span truss. The the speedup due to vectorization. In algorithm C where
upper nodes at the middle of each span are loaded in the microtasking is combined with macrotasking, uniform
vertical y-direction by a 60-kip downward load and in x load balancing is achieved. Thus, the effect of
and z directions by 20-kip loads. The displacements of stripmining is maximized in algorithm C.
the nodes at the middle of each span in the vertical y- Figure 4 shows that vectorization and autotasking
direction are restricted to 1/200the of the span. do not improve the speedup in the function "assemble"
Example 2 is a 1046-member space truss. The because of the existence of function calls and the guarded
geometry of this example is the same as that of example regions. Microtasking improves the speedup in the
65
function "assemble" because the amount of work done in efficient vectorized and multitasked algorithm.
setting up the element stiffness matrices and assembling Algorithm C presented in this paper is an example of
them into the structure stiffness matrix is relatively large such algorithm for optimization of large structures where
compared with the overhead required in microtasking. substantial processing power is required even on high
Algorithm C produces higher speedup than both performance machines.
algorithms A and B . 4. The processing time required for optimization of
Figure 5 shows the speedup for the part of the large structures increase exponentially with the size of
function "solve" that performs the forward and backward the structure ( number of design variables ). Example 3
substitutions for computing the nodal displacement of this paper has 3126 members and 2226 displacement
vector. Both vectorization and multitasking are used in degrees of freedom. Development of efficient
this part. Algorithm C produces substantially higher concurrent algorithms utilizing the unique architecture
speedup compared with the algorithms A and B. and capabilities of high-performance computers results in
Figures 6 to 8 show the speedup results for substantial reduction in the overall execution time.
setting up the load vector, setting up the structure
stiffness matrix, and the forward and backward Acknowledgement
substitutions for computing the nodal displacement
vector for example 3. Comparing Figures 6 to 8 with This research has been partially supported by a
the corresponding Figures 3 to 5 for example 1, we grant from Cray Research, Inc. Computing time was
observe that the speedups due to both vectorization and provided by the Ohio Supercomputer center.
multitasking increase with the size of the structure. The
increase due to multitasking is specially substantial. References
These figures also demonstrate the superiority of
algorithm C where we employed a judicious combination 1. Adeli, H. (1992a1, SuwrcomDutine in Engineering
of vectorization, microtasking, and macrotasking with Analysis, Marcel Dekker, New York.
load balancing. 2. Adeli, H. [1992b], Parallel Processing in
The overall speedup results from the algorithm C ComDutational Mechanics, Marcel Dekker, New York.
for the complete optimization of three space structure 3. Adeli, H. and Kamal, 0. [1989], "Parallel Structural
examples without and with vectorinition are presented in Analysis Using Threads", Microcomputers in Civil
Figures 9 and 10, respectively. It is seta that the Engineering, Vol. 4 No. 2,pp. 133-147.
speedup increases substantially with the size of the 4. Adeli, H. and Kamal, O., [1992a], "Concurrent
structure. Optimization of Large Structures. Part I. Algorithms,"
Journal of Aerospace Engineering, ASCE,Vol. 5 , No.
Conclusions 1, pp. 79-90.
5. Adeli, H. and Kamal, 0. [1992b], "Concurrent
Based on this investigation, a number of general Optimization of Large Structures. Part 11.
conclusions may be drawn: Applications, " Journal of Aerospace Engineering,
1. Autotasking works best in programs where most of ASCE. Vol. 5, NO. 1, pp. 91-110.
the code consists of nested loops , Autotasking can not 6. CRAY, [1990], Crav Standard C Proeramer's
be performed when loop iterations contain inter References Manual , SR-20743.0, Cray Research Inc.,
dependent array elementsor when there i s a function 1990.
call. Most of the speedup due to autotasking is 7. CRAY, [1987], Crav Y-MP Multitasking
achieved by vectorization. Basically, it is effective only Pronrammer's Reference Manual , SR-0222 , Cray
when the program has a simple structure. Research Inc., 1987.
2. Microtasking is simple to implement but does not 8. Hsu, H. L. and Adeli, H., " A Microtasking
improve the speedup substantially in complex problems Algorithm for Optimization of Structures",
without combining it with macrotasking. International Journal of Supercomputer Applications,
3. Judicious combination of vectorization. microtasking, Vol. 5, No. 2, pp. 81-90.
and macrotasking is required in order to develop an
66
TABLE 1 Multitasking algorithms for optimization of space structures with multitasking and
vectorization.
3. Set % = 0.1, iteration= 1 and operation= 1, where operation is a factor to indicate whether
this step is in the analysis stage (operation = 1) or in the redesign stage (operation =2).
4. Assemble the structure stiffness matrix.
Do concurrently:
i - Calculate element stiffness matrices.
A ( autotasking ) B ( microtasking ) C ( macrotaskine with load balancing ).
ii- Assemble element stiffness matrices into the structure stiffness matrix .
A ( autotasking ) B ( microtasking with guarded regions ) C ( microtaskine with guarded
regions.
5. Assemble total load vector.
Do concurrently:
Assemble the nodal forces into the total load vector.
A ( autotaskine and vectorization ) B ( microtasking and vectorization )
'
67
C ( mkrotasking with euarded regions and vectorization ).
11. Calculate the maximum stress ratio (stress-ratio).
A ( autotasking and vectorintion ) B ( microtasking with warded regions and vectorintion )
C ( microtasking with euarded regions and vectorization ).
If there is no constrained displacement, go to step 16.
Otherwise go to step 12.
12. If iteration= 1, go to step 14. Otherwise, if the value of the objective function (W) is less than that
13.
%
of the previous iteration divide the value of by and
two goto step 14.
Find the active displacements, those Within 0.1% of the allowable values.
A ( autotasking ) B ( microtasking ) C ( microtaskine with load balancing ).
14. Apply unit loads in the directions of the most violated degrees of freedom one at a time; each time
go to step6.
15. Calculate the displacement gradients.
Do concurrently:
i - Calculate the element stiffness matrices.
A ( autotasking ) B ( microtasking ) C ( macrotasking with load balancing ).
ii- Calculate the displacement gradients.
A ( autotasking ) B ( microtasking ) C ( microtasking with load balancing ).
16. Calculate the new design variables for the next iteration as follow.
If stress-ratio > 1, modify the design variables using the stress ratio recurrence formula ,
otherwise, modify the design variables using the optimality criteria r e c u m c e formula.
A ( autotasking ) B ( mkrotaskine ) C ( microtaskine with load balancing ).
17. Calculate the new objective function and set iteration= iteration+ 1.
18. Go to step 4.
68
0 simple support
T
16 x 60"
16 x 60"
z Y
t4- 1 2 0 - H t4- 1 2 0 4
Yt
N- 120%
Front elevation
69
0 simple support
X
4
M-tu* w- lac*
Plan
Side Elevation
T
120-
I.
Front Elevation
70
'T -
Algorithm A
-
Algorithm B
5 -
Algorithm C
3 --
4 --
n.
x 2 --
W
n.
v1
3 --
a.
-0
0)
a
CO
1 --
2 --
0 1 2 3 4
#-
0 1 2 3 4 O 1 2 3 4
#-
P # procarors
Figure 5 Speedup comparisons for the forward/ Figure 6 Speedup comparisons for the
backward substitutions in the function "function "load vector" in example 3
"solve" in example 1
71
Theoreticalw%houl
Vedocizat"
4
i 4
Algorithm A
-A!gofilhmB
3 AlgocirhC 3
P a
;2 d
0 2
:
: t
1 1
0
0
/ 1 2 3
I
4
0
0 1 2 3
I
4
lprocmorr a'lwoc-
Figure 7 Speedup comparisons for the Figure 8 Speedup comparisons for the
function "assemble" in example 3. forward and backward substitutions
in the function "solve" in example 3
3
3
a
W
PI
t 2 1
a
va t 2
3
/
1
I
0
0 1 2 3 4
0 1 2 3 4
x proce5on
4l proc-
72