You are on page 1of 2

TECHNICAL LITERATURE

Validity of the Single Processor Approach to


Achieving Large Scale Computing Capabilities
Reprinted from the AFIPS Conference Proceedings, Vol. 30 (Atlantic City, N.J., Apr. 18–20),
AFIPS Press, Reston, Va., 1967, pp. 483–485, when Dr. Amdahl was at International
Business Machines Corporation, Sunnyvale, California
Dr. Gene M. Amdahl

This article was the first publica- The first characteristic of interest be strongly dependent on sweep-
tion by Gene Amdahl on what is the fraction of the computation- ing through the array along differ-
became known as Amdahl's Law. al load which is associated with ent axes on succeeding passes, etc.
Interestingly, it has no equations data management housekeeping. The effect of each of these compli-
and only a single figure. For this This fraction has been very nearly cations is very severe on any com-
issue of the SSCS News, Dr. constant for about ten years, and puter organization based on geo-
Amdahl agreed to redraw the fig- accounts for 40% of the executed metrically related processors in a
ure. In the available hard copy it instructions in production runs. In paralleled processing system. Even
was illegible. We print this his- an entirely dedicated special pur- the existence of regular rectangular
toric paper to enable members to pose environment this might be boundaries has the interesting
read the original source from reduced by a factor of two, but it is property that for spatial dimension
some 40 years ago. highly improbably that it could be of N there are 3N different point
The Editors reduced by a factor of three. The geometries to be dealt with in a
nature of this overhead appears to nearest neighbor computation. If
For over a decade prophets have be sequential so that it is unlikely the second nearest neighbor were
voiced the contention that the to be amenable to parallel process- also involved, there would be 5N
organization of a single computer ing techniques. Overhead alone different point geometries to con-
has reached its limits and that truly would then place an upper limit tend with. An irregular boundary
significant advances can be made on throughput of five to seven compounds this problem as does
only by interconnection of a multi- times the sequential processing an inhomogeneous interior. Com-
plicity of computers in such a rate, even if the housekeeping putations which are dependent on
manner as to permit cooperative were done in a separate processor. the states of variables would
solution. Variously the proper The non-housekeeping part of the require the processing at each
direction has been pointed out as problem could exploit at most a point to consume approximately
general purpose computers with a processor of performance three to the same computational time as
generalized interconnection of four times the performance of the the sum of computations of all
memories, or as specialized com- housekeeping processor. A fairly physical effects within a large
puters with geometrically related obvious conclusion which can be region. Differences or changes in
memory interconnections and con- drawn at this point is that the effort propagation rates may affect the
trolled by one or more instruction expended on achieving high paral- mesh point relationships.
streams. lel processing rates is wasted Ideally the computation of the
Demonstration is made of the unless it is accompanied by action of the neighboring points
continued validity of the single achievements in sequential pro- upon the point under considera-
processor approach and of the cessing rates of very nearly the tion involves their values at a pre-
weaknesses of the multiple proces- same magnitude. vious time proportional to the
sor approach in terms of applica- Data management housekeep- mesh spacing and inversely pro-
tion to real problems and their ing is not the only problem to portional to the propagation rate.
attendant irregularities. plague oversimplified approaches Since the time step is normally
The arguments presented are to high speed computation. The kept constant, a faster propagation
based on statistical characteristics physical problems which are of rate for some effects would imply
of computation on computers over practical interest tend to have interactions with more distant
the last decade and upon the oper- rather significant complications. points. Finally, the fairly common
ational requirements within prob- Examples of these complications practice of sweeping through the
lems of physical interest. An addi- are as follows: Boundaries are like- mesh along different axes on suc-
tional reference will be one of the ly to be irregular; interiors are like- ceeding passes poses problems of
most thorough analyses of relative ly to be inhomogeneous; computa- data management which affects all
computer capabilities currently tions required may be dependent processors; however, it affects geo-
published- "Changes in Computer on the states of the variables at metrically related processors more
Performance," Datamation, Sep- each point; propagation rates of severely by requiring transposing
tember 1966, Professor Kenneth E. different physical effects may be all points in storage in addition to
Knight, Stanford School of Busi- quite different; the rate of conver- the revised input-output schedul-
ness Administration. gence, or convergence at all, may ing. A realistic assessment of the

Summer 2007 IEEE SSCS NEWS 19


TECHNICAL LITERATURE
effect of these irregularities on the
actual performance of a parallel
processing device, compared to its
performance on a simplified and
regularized abstraction of the
problem, yields a degradation in
the vicinity of one-half to one
order of magnitude.
To sum up the effects of data
management housekeeping and of
problem irregularities, the author
has compared three different
machine organizations involving
approximately equal amounts of
hardware. Machine A has thirty
two arithmetic execution units twice the amount of hardware Comparative analysis with asso-
controlled by a single instruction were exploited in a single system, ciative processors is far less easy
stream. Machine B has pipelined one could expect to obtain four and obvious. Under certain condi-
arithmetic execution units with up times the performance. The only tions of regular formats there is a
to three overlapped operations on difficulty is involved in knowing fairly direct approach. Consider an
vectors of eight elements. Machine how to exploit this additional hard- associative processor designed for
C has the same pipelined execu- ware. At any point in time it is dif- pattern recognition, in which deci-
tion units, but initiation of individ- ficult to foresee how the previous sions within individual elements
ual operations at the same rate as bottlenecks in a sequential comput- are forwarded to some set of other
Machine B permitted vector ele- er will be effectively overcome. If it elements. In the associative
ment operations. The performance were easy they would not have processor design the receiving ele-
of these three machines is plotted been left as bottlenecks. It is true ments would have a set of source
in Figure I as a function of the frac- by historical example that the suc- addresses which recognize by
tion of the number of instructions cessive obstacles have been hur- associative techniques whether or
which permit parallelism. The dled, so it is appropriate to quote not it was to receive the decision
probable region of operation is the Rev. Adam Clayton Powell- of the currently declaring element.
centered around a point corre- "Keep the faith, baby!" If alterna- To make a corresponding special
sponding to 25% data management tively one decided to improve the purpose non-associative processor
overhead and l0% of the problem performance by putting two one would consider a receiving
operations forced to be sequential. processors side by side with shared element and its source addresses
The historic performance versus memory, one would find approxi- as an instruction, with binary deci-
cost of computers has been mately 2.2 times as much hard- sions maintained in registers. Con-
explored very thoroughly by Pro- ware. The additional two tenths in sidering the use of thin film mem-
fessor Knight. The carefully ana- hardware accomplish the crossbar ory, an associative cycle would be
lyzed data he presents reflects not switching for the sharing. The longer than a non-destructive read
just execution times for arithmetic resulting performance achieved cycle. In such a technology the
operations and cost of minimum of would be about 1.8. The latter fig- special purpose non-associative
recommended configurations. He ure is derived from the assumption processor can be expected to take
includes memory capacity effects, of each processor utilizing half of about one-fourth as many memory
input-output overlap experienced, the memories about half of the cycles as the associative version
and special functional capabilities. time. The resulting memory con- and only about one-sixth of the
The best statistical fit obtained cor- flicts in the shared system would time. These figures were comput-
responds to a performance propor- extend the execution of one of two ed on the full recognition task,
tional to the square of the cost at operations by one quarter of the with somewhat differing ratios in
any technological level. This result execution time. The net result is a each phase. No blanket claim is
very effectively supports the often price performance degradation to intended here, but rather that each
invoked “Grosch’s Law.” Utilizing 0.8 rather than an improvement to requirement should be investigat-
this analysis, one can argue that if 2.0 for the single larger processor. ed from both approaches.

The diagram above illustrating “Amdahl’s Law” shows that a highly parallel machine has a harder time delivering a fair
fraction of its peak performance due to the sequential component of the given computation and the overhead of coor-
dination (e.g. synchronization) between the processors. Assuming a fixed sized problem, Amdahl speculated that most
programs would require at least 25% of the computation to be sequential (only one instruction executing at a time),
with overhead due to interprocessor coordination averaging 10%. The curves show that the more you depend on par-
allelism for performance, the slower the system is likely to be in the probable case, 65%. The lowest curve (A) repre-
sents the 32-wide SIMD processor, and the top curve (C) is for the modified vector processor. Scaled problems reduce
the sequential component and the coordination overhead to a negligible level, making large numbers of processors very
efficient in those cases. Justin Rattner, Intel Senior Fellow, justin.rattner@intel.com, July 2007.

20 IEEE SSCS NEWS Summer 2007

You might also like