You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220826743

Time and Area Optimization in Processor Architecture.

Conference Paper · January 1997


Source: DBLP

CITATION READS
1 88

1 author:

Michael J. Flynn
Stanford University
333 PUBLICATIONS 9,399 CITATIONS

SEE PROFILE

All content following this page was uploaded by Michael J. Flynn on 15 November 2014.

The user has requested enhancement of the downloaded file.


Time and Area Optimization in Processor Architecture
M. J. Flynn
Stanford University
flynn@ee.stanford.edu

Abstract
For specified program behavior and clocking overhead, there is an optimum cycle time.
This can be improved somewhat by using wave pipelining, but program unpredictability ul-
timately limits performance by restricting both cycle time and instruction level parallelism.
Algorithm and application implementation should be based on understanding of pro-
gram behavior, CAD tools, and technology.
System on a chip can be realized as die potential increases. This system die then con-
sists of collecting a variety of functional implementations and chip. These include core
processor, floating point unit signal processors, cache, message compression and encryp-
tion, etc.
Functional implementations involve selecting particular algorithms so that total appli-
cation execution time is minimized under the constraints of fixed die area.

Underlying all improvements in processor architecture are fundamental notions of the op-
timum use of time and space. In silicon CMOS technologies, the notion of optimum cost–
performance is translated into performance–area optimality, as chip area largely determines
chip cost (with the other major determinant being manufacturing volume).
In the process of optimizing time and space, we deal with a complex set of tradeoffs in-
volving a number of technology sub-disciplines. Time optimization requires control over VLSI
fab processing, state of the art CAD tools and an understanding of physical constraints such as
power and heat dissipation.
An understanding of algorithms allows the user to select the best design choices within
area–time restrictions. Selecting good algorithms also requires a broad understanding of avail-
able technology, behavior of user applications, and available software [4]. Finally, there is die
area itself. This is a primary determinant of marginal production cost. In an era of increasing
availability of chip area through improved technology (reduced feature size), the issue becomes
one of selecting functions that make optimum use of marginally available area. Table 1 shows
the projected performance area improvement for silicon die over the next ten years. Silicon die
that in 1997 have 10 million transistors will be replaced in ten years by die that have more than
25 times that number of active devices. Cycle time is also on a similar curve, so that gigahertz
or subgigahertz cycle times are expected in the same time frame.
 This work has been supported in part by the National Science Foundation under grant MIP-9313701.

1
Table 1: 1994 SIA Roadmap Summary [9].
1st DRAM Year 1992 1995 1998 2001 2004 2007
( )
Feature Size m 0.50 0.35 0.25 0.18 0.13 0.10
VDD V ( ) 5.0 3.3 2.5 1.8 1.5 1.2
Trans/Chip 5M 10M 20M 50M 110M 260M
(
Die Size mm2 ) 210 250 300 360 430 520
(
Freq MHz ) 225 300 450 600 800 1000
DRAM Bits/Chip 16M 64M 256M 1G 4G 16G
SRAM Bits/Chip 4M 16M 64M 256M 1G 4G

1 Time
It is generally, but not always, true that processors with faster cycle times provide better overall
performance. Consider the case of a simple pipelined processor.
For these simple pipelined processors, an estimate of the optimum number of stages in a
pipeline has been determined [2] as Sopt:
s
Sopt = (1 ?bcb)T : (1)

The b term represents the fraction of instructions that cause a pipeline “break” of duration
T.
The ratio c=T is simply clocking overhead (c) per cycle as a fraction of total instruction
execution latency (T ).
If we make the cycle time too large and have too few stages in the pipeline, we sacrifice over-
all performance. On the other hand, if we make the cycle time too small, we incur increasing
amounts of clock overhead and we suffer from additional long pipeline breaks.

Since cycle time, t, is
t = TS + c;

the optimum t corresponds to the optimum Sopt. Adding additional stages beyond Sopt will
lower performance largely due to the accumulation of clock overhead, c:s.
It is possible to use latchless pipelines, or wave pipelining, which use inherent delay in
combinatorial logic to provide a storage element. This allows faster cycle times, as it does not
use any additional clock overhead [4, 11].
Suppose the stage of logic has maximum delay Pmax and minimum delay Pmin, with clock-
ing overhead c. Conventionally, the cycle time for a pipelined system would be defined as

t = Pmax + c: (2)

It is easy to show that, if we can guarantee of at least Pmin on every path from the source register
to the destination register through this stage of logic, that we can clock the piece of logic at a
much higher rate. This rate is

t ? Pmax ? Pmin c: +
The advantage of improving clock rates by wave pipelining is that it does not add clock
overhead to the overall execution time of instructions. In practice, it seems possible to achieve a
clock rate of about one third of the conventionally expected clock rate by using wave pipelining.
In the end, performance enhancements by using long pipelines are limited by unpredictable
behavior of instructions (b) during program execution.
One can imagine an analogous equation to describe the selection of a optimum level of ILP,
ILP opt :
s
ILPopt = (1 ?d d)  O1 : (3)

This relationship is developed strictly as an analog to Sopt in equation 1. Here, O is the


overhead of adding another instruction to an ILP machine as a fraction of the total instruction
execution. It is assumed that O is constant for each additional instruction added. The parameter
d is the fraction of occurrence of significantly disrupting events (akin to a misguessed branch)
which causes the ILP machine to have to restart with a new instruction count. The tradeoff
between VLIW and superscalar machines is simply the tradeoff between reducing overhead
with a VLIW machine and potentially finding potential parallelism and hence reducing the
disruption with a superscalar machine. As before, latency tolerance reduces for all organizations
d, the probability of a disruption, and hence improves the ILP available.

For either processor approach (fast t or increased ILP), the performance is bounded by
implementation overhead and program behavior. We can use our understanding of algorithms
to advance both of these limits.

2 Algorithms [10]
Algorithms tie together technology, CAD, and software into an implementation which suits
a particular application. In this section, we illustrate several important aspects of algorithm
implementation in general, using advances in familiar floating point arithmetic operations as an
important exemplar.

2.1 Floating-point Add


Floating-point add basically consists of five steps:

 Exponent difference

 (Pre) shift for significand (fraction) alignment

 Significand add/subtract

 (Post) shift for significand realignment

 Round and align

We refer to such floating-point add implementations as “textbook” FP adders. The textbook


adder represented the state of the art in floating point addition up until the early 1980’s. It con-
sisted of three additions (or subtractions) and two long shifts (usually implemented with a barrel
FAR CLOSE

Exp Diff Predict


+ +
Swap Swap

Rshift ComAdd LOP

PENC

HalfAdd
Lshift

ComAdd

MUX

Figure 1: Three cycle pipelined adder with combined rounding. LOP is leading “1” prediction;
PENC is priority encoding.

shifter). Of course, it was always possible to build slower and less expensive implementations
simply by reusing adders or shifters, etc. But for the moment, we consider only fastest state
of the art implementations. In the early 1980’s, Farmwald [3] introduced what is called a “two
path” floating point adder. He does this (Figure 1) by recognizing that it is not possible for the
two long shifts to occur with the same set of operands. A long left shift (the second shift) can
only occur if two numbers of the same fractional magnitude are subtracted. If the exponents
are far apart (the FAR path), then it is not possible for the results of the subtraction of two
floating point numbers to require a long left shift. Then, of course, the numbers can cancel each
other out and the leading digits in the result can be all zeros, thus requiring a long left shift to
renormalize the sum of the add. Thus, it is possible to build an adder that separates the two
adder cases. The first case computes the exponent difference and does the expected long right
(pre-)shift, putting the smaller number in proper position with respect to the larger number, an-
ticipating either an add or a subtract that cannot cause cancellation requiring a left shift. The
second case recognizes that the exponents are close (CLOSE path), thus they cannot require a
long preshifting of operands to align them before the addition or subtraction can occur. In this
path, it is possible to have cancellation under subtraction which requires the long left shift.
In the late 1980’s, the art was further advanced by the use of compound adders to replace
the final addition which is required for rounding [8]. A compound adder (ComADD) is used,
+
which, given x and y as inputs, produces both x y as the sum, and x y + +1 as an augmented
sum. For most cases of rounding, one of these is sufficient to give the final correct result of a
floating point addition. Control circuitry to select which of the possible precomputed outcomes
is the correct one.
It is possible to take this one step further, using variable latency algorithms which are re-
cently developed by Oberman [7]. This approach recognizes that there are many common cases
where the result could be available early. For example, in certain cases, the two operands have
the same exponent (the CLOSE case) and the operation is addition, so only the simple left shift
could be required. In another case, if the exponents are quite far apart, the smaller number
simply slips out of range of the larger number and, depending on rounding, perhaps no addition
may be necessary. In each of these cases possible simplification can be detected early and func-
tional speedup is possible. For example, instead of a three-cycle floating addition, it might be
possible to do addition in one or two cycles for at least some of the operands, and three cycles
in the worst case. Oberman shows that a speedup of 1.33 is possible.
Lessons from FP Addition

1. By fully understanding the distribution of input data, we can achieve a significant speedup
in the overall execution of an algorithm.

2. To actually use this speedup, we must have a robust host machine that is able to use results
whose execution time is unpredictable.

2.2 Floating Point Multiply


The basic steps in multiplication are:

 Generate partial products

 Reduce (add) partial products to form for each bit (column) a sum and carry

 Final addition (and round) of column sum and carry

For an n  n multiplication, we need to generate up to n multiples of the multiplicand. These


are called partial products, and must be reduced by using counters or carry saving adders to a
final sum for each column and a carry into the next column. The last step simply adds the sums
and carrys together, producing a final result. As the exponent arithmetic can largely be done
concurrently, it is generally not in the critical path. Rounding is usually not in the critical path,
as the lower order round bits can be determined early in the final addition process.
Al-Twaijry [1] has completed an interesting study on the possible ways to implement float-
ing point multiplication. By doing a complete design and layout for almost 1,000 separate
multiply implementations, he examined the effect of various algorithmic and implementation
approaches on multiplier performance. Among the most interesting findings is that the algo-
rithmic selection is a function of certain technology parameters such as the feature size. At
large feature sizes such as 1 micron, the overall multiplier delay is determined primarily by gate
delays through the circuits themselves. As feature sizes shrink, the wires determine the overall
delay for the multiplier implementation. Thus, techniques that shorten the wires produce the
fastest multiplier. In a typical multiplier implementation, most of the circuitry and the wires are
in the partial product reduction step. In order to optimize delay through this tree of wires:
1.50

|
Ratio
1.25

|
1.00       

|
0.75
|

0.50
|

 Binary Tree
Automatic Layout
0.25
|

0.00 | | | | | | | | | | | |
|

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
Feature Size

Figure 2: Relative delay: algorithmic layout to binary tree for non-Booth double precision and
Booth-2 quad precision.

1. The designer can use a Booth encoding of the multiplier; this reduces the number of
2
partial products from n to n= by using shifted and signed versions of the multiplicand
as partial products. By reducing the number of partial products, we shrink the size of
the partial product reduction tree, even though we increase the circuitry necessary to
generate the partial products. At large feature size, the generation delay dominates the
overall multiply time and Booth encoding is not recommended. At small feature sizes, the
wires in the partial product reduction tree dominate the delay, and hence Booth encoding
(called Booth 2) is preferable.

2. In actually building the partial product reduction tree, Al-Twaijry shows that an algorith-
mically laid out tree of counters using CAD tools in a so-called Wallace tree is better than
42
the more regular and custom design “ ; ” compressor layout, which constructs a bal-
anced binary tree of counters. The algorithmic layout of each column must be performed
by a computer which actually places counters and wires to minimize the overall effect of
wire delay.

Figure 2 shows the relative delay of the two approaches as a function of feature size. Notice
that as the feature size decreases, the algorithmic approach shows increased advantage.
Lessons from Floating Point Multiply

1. In most cases, there is no single optimum algorithm independent of technology. Specific


technology parameters must be known to determine the optimum implementation of any
application.
20

|
Interlock Distance (Instructions)
non opt Avg = 3.34
opt Avg = 10.22
15

10
|

5
|

0|
|

spice2g6

doduc

mdljdp2

tomcatv

ora

alvinn

ear

su2cor

hydro2d

nasa7

fpppp
Figure 3: Interlock distances.

2. CAD tools play a vital role in allowing implementation optimization. These CAD tools
are generally at a higher level than simple schematic entry level routing and placement
that we have known and extensively used in the past.

2.3 Floating Point Division


Division is a time-consuming operation. Even under the best cases, division using multipliers
based upon Newton-Raphson methods requires between 8 and 10 cycles to perform a double-
precision floating point division. The question for the floating point division implementor is:
how fast does a divider really have to be? Is it possible to arrange the code so that the use of the
quotient does not occur until a number of instructions later?
If we can do this, we provide a form of latency tolerance for the division result. Oberman
[6] has performed an extensive study of the latency tolerance of floating point division. In his
results (Figure 3), he shows that for typical programs without any particular software support
for latency tolerance the quotient is required about three instructions after the divide instruc-
tion. Thus, a division that takes, say, 10 cycles will cause a 7-cycle disruption on a machine
that executes an instruction each cycle. He also shows that it is possible, by reordering code,
to create more robust latency tolerance on behalf of the floating point divide operation. This
improves the separation between the divide operation and the user using instruction to over 10
instructions. Thus, the same 10-cycle divide operation sees no delay in overall program exe-
cution, assuming the same single instruction per cycle execution rate. Of course, as processors
become faster through ILP, it may no longer be acceptable to have a 10-cycle divider, because
now 10 instructions may be executed in only 5 cycles, thus we again are faced with the problem
with a long delay due to division. In any event, the software provides some relief, at least in the
shorter term.
Lessons from Floating Point Divide

1. Robust compilers and systems software can provide significant help in performance opti-
mization.

2. As processors exploit increasing levels of ILP in their search for speed, even optimized
software can provide only a certain amount of relief before simply faster implementations
are also required.

2.4 Comment
In our lessons, above, the three floating point operations span a spectrum of typical algorithms.
The actual details in optimizing the floating point operations are less important than the overall
lessons. The computer design process involves optimizing and selecting algorithms based on
full knowledge and characterization of technology, CAD tools, input data, systems software,
and host processor architecture.

3 Area
The last basic tradeoff exists in determining an optimum die size [4]. A die yield is a function
of both defect density (D ) and the die area, A. In a simple Poisson model, the yield is

Yield = e?D A
A typical defect density is about one defect per square centimeter. Die cost is determined
by dividing the wafer cost by the number of good chips realized on the wafer. As die area
increases, both the total number of chips on the wafer decrease, and the number of good chips
decreases as the yield decreases. Since there is a fixed packaging testing cost associated with
each die, making a die very small does not significantly decrease overall packaged die costs;
indeed, having too little functionality on the packaged die decreases the attractiveness of the
part to the buyer. This decreases the production run of the part and increases costs. Currently,
most processor dies have area between 1.5 and 4.0 cm 2 . Of course, what one actually achieves
on a die depends on the lithography, which in turn determines the number of devices that can
be placed on a given die area. The finer the lithography, the smaller the devices, and the more
devices that can be placed on the die. Lithography is usually expressed in terms of minimum
feature size (or the size of the smallest realizable transistor). With a feature size of one micron,
the minimum transistor would take f 2 area.
The area optimization problem is simply making the best use of available silicon. Each
algorithmic state of the art can be encapsulated in an AT curve, as shown in Figure 4. In
general, we have
=
AT n k;
where n usually ranges from 1 to 2, depending on the nature of communication within the
implementation.
Table 2: FUPA components and results of recently announced processors

Processor Effective Normalized Normalized Effective FUPA


Latency (ns) Area (mm2) Latency (ns) (cm2ns)
DEC21164 12.38 77.51 35.36 27.41
MIPS R10000 14.25 70.08 40.71 28.53
PA8000 24.44 81.16 48.89 39.68
Intel P6 28.25 51.56 80.71 41.61
SUN UltraSparc 23.65 133.43 50.32 67.15
AMD K5 66.00 48.42 188.57 91.32

Note that k defines the “state of the art”—or “par” designs—at any particular moment. If
some use design points in the interior, the resultant design is inferior or “over par”. Over time,
k (and “par”) decreases due to advances in technology and advances in algorithms themselves.
To study algorithms, the effect of technology can be largely eliminated by using a technol-
ogy normalized and time. Along this line, we developed a cost-performance metric for evaluat-
ing floating-point unit implementations [5]. The metric, Floating Point Unit Cost Performance
Analysis Metric (FUPA), incorporates five key aspects of VLSI systems design: latency, die
area, power, minimum feature size and profile of applications. FUPA utilizes technology pro-
jections based on scalable device models to identify the design/technology compatibility, and
allows designers to make high level tradeoffs in optimizing FPU designs.
We summarize the five steps of computing FUPA as:

1. Profile the applications to obtain dynamic floating point operation (add/sub, multiply, and
divide) distribution in the application,

2. Compute Effective Latency (EL) from the clock rate, FPU latencies, and the dynamic FP
operation distribution obtained in step 1,

3. Measure the die area (Area) of the FPU not including the register file,

4. Compute Normalized Effective Latency (NEL) and Normalized Area (NArea), removing
the feature size dependency, and

5. Compute FUPA where


=
FUPA (NEL)(100NArea) .
Example applications of FUPA to recently announced processors are shown in Table 2.
Processor FPU’s with the lowest FUPA define the state of the art.
The more the area, the lower the latency or number of cycles per instruction. The more
the area, the faster the algorithm. Of course, rather than a continuity of points across the AT
spectrum, there are only discrete design points based upon validated known designs and design
tradeoffs.
Suppose we could go to a web site that maintained the updated “state of the art” (AT k )=
for a particular algorithm. If the various selection sites supported signal processors, floating
point, multimedia functions, etc., the designer could select an optimum set of implementations
Par design

T over par

Validated
scalable
under par design
points

Figure 4: “Par” designs.

for a particular application (by allowing more or less area to a function) under a fixed total area
constraint. This gives an optimum “system on a chip” or system die implementation.
The problem is to select among multiple design points across multiple applications to make
the best use of die area (Figure 4). To make system design on a chip work, one must have
several supporting tools.

1. A series of state-of-the-art scalable designs for each component of the system.

2. These designs must be scalable for various processes and for various process parameters,
such as feature size, wire pitch, number of wiring levels, etc.

3. Each of these design points must be validated across all data points.

4. The combination for each of the designs must have a ready interface to accommodate the
various interconnection busses.

4 Conclusions
Processor architecture is increasingly dependent on the understanding of technologies outside
that of traditional algorithm/logic implementation. The understanding of physical technology
and the availability of support tools in CAD and compilers are essential in an advanced proces-
sor and functional unit implementations. Indeed, as the technology improves and a complete
“system on a chip” becomes available, it is even more important to have access to advanced,
optimized and constantly updated functional unit designs. Thus, the task of the computer archi-
tect is to know, not only the basics of processor design, but also to be completely familiar with
all the supporting technology at both the processor and the system level.
References
[1] H. Al-Twaijry. Area and Performance Optimized CMOS Multipliers PhD thesis, Stanford
University, 1997.

[2] P. K. Dubey and M. J. Flynn. Optimal pipelining. Journal of Parallel and Distributed
Computing, 8:10–19, 1990.

[3] M. P. Farmwald, On the Design of High Performance Digital Arithmetic Units, Ph.D.
thesis, Stanford University, Aug. 1981.

[4] M. J. Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Jones &
Bartlett, Boston, 1995. 788 pages.

[5] S. Fu, N. Quach, and M. Flynn. Architecture Evaluator’s Work Bench and Its Applica-
tion to Microprocessor Floating Point Units. Technical report CSL-TR-95-668, Stanford
University, June 1995.

[6] S. F. Oberman and M. J. Flynn, “Design issues in division and other floating-point opera-
tions.” To appear in IEEE Trans. Computers, 1997.

[7] S. F. Oberman and M. J. Flynn, “A variable latency pipelined floating-point adder.” In


Proc. Euro-Par’96, Springer LNCS, pp. 183–192, Aug. 1996.

[8] N. T. Quach. Reducing the Latency of Floating-Point Arithmetic Operations. PhD thesis,
Stanford University, 1993.

[9] Semiconductor Industry Association. The National Technology Roadmap for Semicon-
ductors. San Jose, CA, 1994.

[10] S. Oberman, H. Al-Twaijry, and M. Flynn. The SNAP Project: Design of Floating Point
Arithmetic Units In Proceedings of Arith-13, Pacific Grove, July 1997.

[11] D. Wong, G. De Micheli, and M. Flynn. Designing high-performance digital circuits using
wave pipelining: Algorithms and practical experiences. IEEE Transactions on Computer
Aided Design of Integrated Circuits, 12(1):25–46, January 1993.

View publication stats

You might also like