Professional Documents
Culture Documents
net/publication/220826743
CITATION READS
1 88
1 author:
Michael J. Flynn
Stanford University
333 PUBLICATIONS 9,399 CITATIONS
SEE PROFILE
All content following this page was uploaded by Michael J. Flynn on 15 November 2014.
Abstract
For specified program behavior and clocking overhead, there is an optimum cycle time.
This can be improved somewhat by using wave pipelining, but program unpredictability ul-
timately limits performance by restricting both cycle time and instruction level parallelism.
Algorithm and application implementation should be based on understanding of pro-
gram behavior, CAD tools, and technology.
System on a chip can be realized as die potential increases. This system die then con-
sists of collecting a variety of functional implementations and chip. These include core
processor, floating point unit signal processors, cache, message compression and encryp-
tion, etc.
Functional implementations involve selecting particular algorithms so that total appli-
cation execution time is minimized under the constraints of fixed die area.
Underlying all improvements in processor architecture are fundamental notions of the op-
timum use of time and space. In silicon CMOS technologies, the notion of optimum cost–
performance is translated into performance–area optimality, as chip area largely determines
chip cost (with the other major determinant being manufacturing volume).
In the process of optimizing time and space, we deal with a complex set of tradeoffs in-
volving a number of technology sub-disciplines. Time optimization requires control over VLSI
fab processing, state of the art CAD tools and an understanding of physical constraints such as
power and heat dissipation.
An understanding of algorithms allows the user to select the best design choices within
area–time restrictions. Selecting good algorithms also requires a broad understanding of avail-
able technology, behavior of user applications, and available software [4]. Finally, there is die
area itself. This is a primary determinant of marginal production cost. In an era of increasing
availability of chip area through improved technology (reduced feature size), the issue becomes
one of selecting functions that make optimum use of marginally available area. Table 1 shows
the projected performance area improvement for silicon die over the next ten years. Silicon die
that in 1997 have 10 million transistors will be replaced in ten years by die that have more than
25 times that number of active devices. Cycle time is also on a similar curve, so that gigahertz
or subgigahertz cycle times are expected in the same time frame.
This work has been supported in part by the National Science Foundation under grant MIP-9313701.
1
Table 1: 1994 SIA Roadmap Summary [9].
1st DRAM Year 1992 1995 1998 2001 2004 2007
( )
Feature Size m 0.50 0.35 0.25 0.18 0.13 0.10
VDD V ( ) 5.0 3.3 2.5 1.8 1.5 1.2
Trans/Chip 5M 10M 20M 50M 110M 260M
(
Die Size mm2 ) 210 250 300 360 430 520
(
Freq MHz ) 225 300 450 600 800 1000
DRAM Bits/Chip 16M 64M 256M 1G 4G 16G
SRAM Bits/Chip 4M 16M 64M 256M 1G 4G
1 Time
It is generally, but not always, true that processors with faster cycle times provide better overall
performance. Consider the case of a simple pipelined processor.
For these simple pipelined processors, an estimate of the optimum number of stages in a
pipeline has been determined [2] as Sopt:
s
Sopt = (1 ?bcb)T : (1)
The b term represents the fraction of instructions that cause a pipeline “break” of duration
T.
The ratio c=T is simply clocking overhead (c) per cycle as a fraction of total instruction
execution latency (T ).
If we make the cycle time too large and have too few stages in the pipeline, we sacrifice over-
all performance. On the other hand, if we make the cycle time too small, we incur increasing
amounts of clock overhead and we suffer from additional long pipeline breaks.
Since cycle time, t, is
t = TS + c;
the optimum t corresponds to the optimum Sopt. Adding additional stages beyond Sopt will
lower performance largely due to the accumulation of clock overhead, c:s.
It is possible to use latchless pipelines, or wave pipelining, which use inherent delay in
combinatorial logic to provide a storage element. This allows faster cycle times, as it does not
use any additional clock overhead [4, 11].
Suppose the stage of logic has maximum delay Pmax and minimum delay Pmin, with clock-
ing overhead c. Conventionally, the cycle time for a pipelined system would be defined as
t = Pmax + c: (2)
It is easy to show that, if we can guarantee of at least Pmin on every path from the source register
to the destination register through this stage of logic, that we can clock the piece of logic at a
much higher rate. This rate is
t ? Pmax ? Pmin c: +
The advantage of improving clock rates by wave pipelining is that it does not add clock
overhead to the overall execution time of instructions. In practice, it seems possible to achieve a
clock rate of about one third of the conventionally expected clock rate by using wave pipelining.
In the end, performance enhancements by using long pipelines are limited by unpredictable
behavior of instructions (b) during program execution.
One can imagine an analogous equation to describe the selection of a optimum level of ILP,
ILP opt :
s
ILPopt = (1 ?d d) O1 : (3)
2 Algorithms [10]
Algorithms tie together technology, CAD, and software into an implementation which suits
a particular application. In this section, we illustrate several important aspects of algorithm
implementation in general, using advances in familiar floating point arithmetic operations as an
important exemplar.
Exponent difference
Significand add/subtract
PENC
HalfAdd
Lshift
ComAdd
MUX
Figure 1: Three cycle pipelined adder with combined rounding. LOP is leading “1” prediction;
PENC is priority encoding.
shifter). Of course, it was always possible to build slower and less expensive implementations
simply by reusing adders or shifters, etc. But for the moment, we consider only fastest state
of the art implementations. In the early 1980’s, Farmwald [3] introduced what is called a “two
path” floating point adder. He does this (Figure 1) by recognizing that it is not possible for the
two long shifts to occur with the same set of operands. A long left shift (the second shift) can
only occur if two numbers of the same fractional magnitude are subtracted. If the exponents
are far apart (the FAR path), then it is not possible for the results of the subtraction of two
floating point numbers to require a long left shift. Then, of course, the numbers can cancel each
other out and the leading digits in the result can be all zeros, thus requiring a long left shift to
renormalize the sum of the add. Thus, it is possible to build an adder that separates the two
adder cases. The first case computes the exponent difference and does the expected long right
(pre-)shift, putting the smaller number in proper position with respect to the larger number, an-
ticipating either an add or a subtract that cannot cause cancellation requiring a left shift. The
second case recognizes that the exponents are close (CLOSE path), thus they cannot require a
long preshifting of operands to align them before the addition or subtraction can occur. In this
path, it is possible to have cancellation under subtraction which requires the long left shift.
In the late 1980’s, the art was further advanced by the use of compound adders to replace
the final addition which is required for rounding [8]. A compound adder (ComADD) is used,
+
which, given x and y as inputs, produces both x y as the sum, and x y + +1 as an augmented
sum. For most cases of rounding, one of these is sufficient to give the final correct result of a
floating point addition. Control circuitry to select which of the possible precomputed outcomes
is the correct one.
It is possible to take this one step further, using variable latency algorithms which are re-
cently developed by Oberman [7]. This approach recognizes that there are many common cases
where the result could be available early. For example, in certain cases, the two operands have
the same exponent (the CLOSE case) and the operation is addition, so only the simple left shift
could be required. In another case, if the exponents are quite far apart, the smaller number
simply slips out of range of the larger number and, depending on rounding, perhaps no addition
may be necessary. In each of these cases possible simplification can be detected early and func-
tional speedup is possible. For example, instead of a three-cycle floating addition, it might be
possible to do addition in one or two cycles for at least some of the operands, and three cycles
in the worst case. Oberman shows that a speedup of 1.33 is possible.
Lessons from FP Addition
1. By fully understanding the distribution of input data, we can achieve a significant speedup
in the overall execution of an algorithm.
2. To actually use this speedup, we must have a robust host machine that is able to use results
whose execution time is unpredictable.
Reduce (add) partial products to form for each bit (column) a sum and carry
|
Ratio
1.25
|
1.00
|
0.75
|
0.50
|
Binary Tree
Automatic Layout
0.25
|
0.00 | | | | | | | | | | | |
|
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
Feature Size
Figure 2: Relative delay: algorithmic layout to binary tree for non-Booth double precision and
Booth-2 quad precision.
1. The designer can use a Booth encoding of the multiplier; this reduces the number of
2
partial products from n to n= by using shifted and signed versions of the multiplicand
as partial products. By reducing the number of partial products, we shrink the size of
the partial product reduction tree, even though we increase the circuitry necessary to
generate the partial products. At large feature size, the generation delay dominates the
overall multiply time and Booth encoding is not recommended. At small feature sizes, the
wires in the partial product reduction tree dominate the delay, and hence Booth encoding
(called Booth 2) is preferable.
2. In actually building the partial product reduction tree, Al-Twaijry shows that an algorith-
mically laid out tree of counters using CAD tools in a so-called Wallace tree is better than
42
the more regular and custom design “ ; ” compressor layout, which constructs a bal-
anced binary tree of counters. The algorithmic layout of each column must be performed
by a computer which actually places counters and wires to minimize the overall effect of
wire delay.
Figure 2 shows the relative delay of the two approaches as a function of feature size. Notice
that as the feature size decreases, the algorithmic approach shows increased advantage.
Lessons from Floating Point Multiply
|
Interlock Distance (Instructions)
non opt Avg = 3.34
opt Avg = 10.22
15
10
|
5
|
0|
|
spice2g6
doduc
mdljdp2
tomcatv
ora
alvinn
ear
su2cor
hydro2d
nasa7
fpppp
Figure 3: Interlock distances.
2. CAD tools play a vital role in allowing implementation optimization. These CAD tools
are generally at a higher level than simple schematic entry level routing and placement
that we have known and extensively used in the past.
1. Robust compilers and systems software can provide significant help in performance opti-
mization.
2. As processors exploit increasing levels of ILP in their search for speed, even optimized
software can provide only a certain amount of relief before simply faster implementations
are also required.
2.4 Comment
In our lessons, above, the three floating point operations span a spectrum of typical algorithms.
The actual details in optimizing the floating point operations are less important than the overall
lessons. The computer design process involves optimizing and selecting algorithms based on
full knowledge and characterization of technology, CAD tools, input data, systems software,
and host processor architecture.
3 Area
The last basic tradeoff exists in determining an optimum die size [4]. A die yield is a function
of both defect density (D ) and the die area, A. In a simple Poisson model, the yield is
Yield = e?D A
A typical defect density is about one defect per square centimeter. Die cost is determined
by dividing the wafer cost by the number of good chips realized on the wafer. As die area
increases, both the total number of chips on the wafer decrease, and the number of good chips
decreases as the yield decreases. Since there is a fixed packaging testing cost associated with
each die, making a die very small does not significantly decrease overall packaged die costs;
indeed, having too little functionality on the packaged die decreases the attractiveness of the
part to the buyer. This decreases the production run of the part and increases costs. Currently,
most processor dies have area between 1.5 and 4.0 cm 2 . Of course, what one actually achieves
on a die depends on the lithography, which in turn determines the number of devices that can
be placed on a given die area. The finer the lithography, the smaller the devices, and the more
devices that can be placed on the die. Lithography is usually expressed in terms of minimum
feature size (or the size of the smallest realizable transistor). With a feature size of one micron,
the minimum transistor would take f 2 area.
The area optimization problem is simply making the best use of available silicon. Each
algorithmic state of the art can be encapsulated in an AT curve, as shown in Figure 4. In
general, we have
=
AT n k;
where n usually ranges from 1 to 2, depending on the nature of communication within the
implementation.
Table 2: FUPA components and results of recently announced processors
Note that k defines the “state of the art”—or “par” designs—at any particular moment. If
some use design points in the interior, the resultant design is inferior or “over par”. Over time,
k (and “par”) decreases due to advances in technology and advances in algorithms themselves.
To study algorithms, the effect of technology can be largely eliminated by using a technol-
ogy normalized and time. Along this line, we developed a cost-performance metric for evaluat-
ing floating-point unit implementations [5]. The metric, Floating Point Unit Cost Performance
Analysis Metric (FUPA), incorporates five key aspects of VLSI systems design: latency, die
area, power, minimum feature size and profile of applications. FUPA utilizes technology pro-
jections based on scalable device models to identify the design/technology compatibility, and
allows designers to make high level tradeoffs in optimizing FPU designs.
We summarize the five steps of computing FUPA as:
1. Profile the applications to obtain dynamic floating point operation (add/sub, multiply, and
divide) distribution in the application,
2. Compute Effective Latency (EL) from the clock rate, FPU latencies, and the dynamic FP
operation distribution obtained in step 1,
3. Measure the die area (Area) of the FPU not including the register file,
4. Compute Normalized Effective Latency (NEL) and Normalized Area (NArea), removing
the feature size dependency, and
T over par
Validated
scalable
under par design
points
for a particular application (by allowing more or less area to a function) under a fixed total area
constraint. This gives an optimum “system on a chip” or system die implementation.
The problem is to select among multiple design points across multiple applications to make
the best use of die area (Figure 4). To make system design on a chip work, one must have
several supporting tools.
2. These designs must be scalable for various processes and for various process parameters,
such as feature size, wire pitch, number of wiring levels, etc.
3. Each of these design points must be validated across all data points.
4. The combination for each of the designs must have a ready interface to accommodate the
various interconnection busses.
4 Conclusions
Processor architecture is increasingly dependent on the understanding of technologies outside
that of traditional algorithm/logic implementation. The understanding of physical technology
and the availability of support tools in CAD and compilers are essential in an advanced proces-
sor and functional unit implementations. Indeed, as the technology improves and a complete
“system on a chip” becomes available, it is even more important to have access to advanced,
optimized and constantly updated functional unit designs. Thus, the task of the computer archi-
tect is to know, not only the basics of processor design, but also to be completely familiar with
all the supporting technology at both the processor and the system level.
References
[1] H. Al-Twaijry. Area and Performance Optimized CMOS Multipliers PhD thesis, Stanford
University, 1997.
[2] P. K. Dubey and M. J. Flynn. Optimal pipelining. Journal of Parallel and Distributed
Computing, 8:10–19, 1990.
[3] M. P. Farmwald, On the Design of High Performance Digital Arithmetic Units, Ph.D.
thesis, Stanford University, Aug. 1981.
[4] M. J. Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Jones &
Bartlett, Boston, 1995. 788 pages.
[5] S. Fu, N. Quach, and M. Flynn. Architecture Evaluator’s Work Bench and Its Applica-
tion to Microprocessor Floating Point Units. Technical report CSL-TR-95-668, Stanford
University, June 1995.
[6] S. F. Oberman and M. J. Flynn, “Design issues in division and other floating-point opera-
tions.” To appear in IEEE Trans. Computers, 1997.
[8] N. T. Quach. Reducing the Latency of Floating-Point Arithmetic Operations. PhD thesis,
Stanford University, 1993.
[9] Semiconductor Industry Association. The National Technology Roadmap for Semicon-
ductors. San Jose, CA, 1994.
[10] S. Oberman, H. Al-Twaijry, and M. Flynn. The SNAP Project: Design of Floating Point
Arithmetic Units In Proceedings of Arith-13, Pacific Grove, July 1997.
[11] D. Wong, G. De Micheli, and M. Flynn. Designing high-performance digital circuits using
wave pipelining: Algorithms and practical experiences. IEEE Transactions on Computer
Aided Design of Integrated Circuits, 12(1):25–46, January 1993.