This action might not be possible to undo. Are you sure you want to continue?
Part VII
Implementation Topics
Number Representation
Numbers and Arithmetic
Representing Signed Numbers
Redundant Number Systems
Residue Number Systems
Addition / Subtraction
Basic Addition and Counting
CarryLookahead Adders
Variations in Fast Adders
Multioperand Addition
Multiplication
Basic Multiplication Schemes
HighRadix Multipliers
Tree and Array Multipliers
Variations in Multipliers
Division
Basic Division Schemes
HighRadix Di viders
Variations in Divi ders
Division by Convergence
Real Arithmetic
FloatingPoint Reperesentations
FloatingPoint Operations
Errors and Error Control
Precise and Certifi able Arithmetic
Function Evaluation
SquareRooting Methods
The CORDIC Algorithms
Variations in Function Evaluation
Arithmetic by Table Lookup
Implementation Topics
HighThroughput Arithmetic
LowPower Arithmetic
FaultTol erant Arithmetic
Past, Present, and Future
Parts Chapters
I.
II.
III.
IV.
V.
VI.
VII.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
25.
26.
27.
28.
21.
22.
23.
24.
17.
18.
19.
20.
13.
14.
15.
16.
E
l
e
m
e
n
t
a
r
y
O
p
e
r
a
t
i
o
n
s
28. Reconfigurable Arithmetic
Appendix: Past, Present, and Future
May 2010 Computer Arithmetic, Implementation Topics Slide 2
About This Presentation
Edition Released Revised Revised Revised Revised
First
Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 Dec. 2007
Second
May 2010
This presentation is intended to support the use of the textbook
Computer Arithmetic: Algorithms and Hardware Designs (Oxford
U. Press, 2nd ed., 2010, ISBN 9780195328486). It is updated
regularly by the author as part of his teaching of the graduate
course ECE 252B, Computer Arithmetic, at the University of
California, Santa Barbara. Instructors can use these slides freely
in classroom teaching and for other educational purposes.
Unauthorized uses are strictly prohibited. © Behrooz Parhami
May 2010 Computer Arithmetic, Implementation Topics Slide 3
VII Implementation Topics
Topics in This Part
Chapter 25 HighThroughput Arithmetic
Chapter 26 LowPower Arithmetic
Chapter 27 FaultTolerant Arithmetic
Chapter 28 Reconfigurable Arithmetic
Sample advanced implementation methods and tradeoffs
• Speed / latency is seldom the only concern
• We also care about throughput, size, power/energy
• Faultinduced errors are different from arithmetic errors
• Implementation on Programmable logic devices
May 2010 Computer Arithmetic, Implementation Topics Slide 4
May 2010 Computer Arithmetic, Implementation Topics Slide 5
25 HighThroughput Arithmetic
Chapter Goals
Learn how to improve the performance of
an arithmetic unit via higher throughput
rather than reduced latency
Chapter Highlights
To improve overall performance, one must
 Look beyond individual operations
 Trade off latency for throughput
For example, a multiply may take 20 cycles,
but a new one can begin every cycle
Data availability and hazards limit the depth
May 2010 Computer Arithmetic, Implementation Topics Slide 6
HighThroughput Arithmetic: Topics
Topics in This Chapter
25.1 Pipelining of Arithmetic Functions
25.2 Clock Rate and Throughput
25.3 The Earle Latch
25.4 Parallel and DigitSerial Pipelines
25.5 OnLine of DigitPipelined Arithmetic
25.6 Systolic Arithmetic Units
May 2010 Computer Arithmetic, Implementation Topics Slide 7
25.1 Pipelining of Arithmetic Functions
Throughput
Operations per unit time
Pipelining period
Interval between applying
successive inputs
In Out
1
. . .
Interstage latches
Input
latches
Output
latches
In
Out
Nonpipelined
t/ +t
2
o
3
o
t + ot
t
Fig. 25.1 An arithmetic function unit
and its ostage pipelined version.
Latency, though a secondary consideration, is still important because:
a. Occasional need for doing single operations
b. Dependencies may lead to bubbles or even drainage
At times, pipelined implementation may improve the latency of a multistep
computation and also reduce its cost; in this case, advantage is obvious
May 2010 Computer Arithmetic, Implementation Topics Slide 8
Analysis of Pipelining Throughput
In Out
1 . . .
Interstage latches
Input
latches
Output
latches
In
Out
Nonpipelined
t/ +t
2 o 3
o
t + ot
t
Consider a circuit with cost (gate count) g and latency t
Simplifying assumptions for our analysis:
1. Time overhead per stage is t (latching delay)
2. Cost overhead per stage is ¸ (latching cost)
3. Function is divisible into o equal stages for any o
Then, for the pipelined implementation:
Latency T = t + ot
1 1
Throughput R = =
T / o t / o + t
Cost G = g + o¸
Throughput approaches its maximum of 1/t for large o
Fig. 25.1
May 2010 Computer Arithmetic, Implementation Topics Slide 9
Analysis of Pipelining CostEffectiveness
1 1
T = t + ot R = = G = g + o¸
T / o t / o + t
Latency Throughput Cost
Consider costeffectiveness to be throughput per unit cost
E = R / G = o / [(t + ot)(g + o¸)]
To maximize E, compute dE/do and equate the numerator with 0
tg – o
2
t¸ = 0 ¬ o
opt
= \ tg / (t¸)
We see that the most costeffective number of pipeline stages is:
Directly related to the latency and cost of the function;
it pays to have many stages if the function is very slow or complex
Inversely related to pipelining delay and cost overheads;
few stages are in order if pipelining overheads are fairly high
All in all, not a surprising result!
May 2010 Computer Arithmetic, Implementation Topics Slide 10
In Out
1
. . .
Interstage latches
Input
latches
Output
latches
In
Out
Nonpipelined
t/ +t
2
o
3
o
t + ot
t
25.2 Clock Rate and Throughput
Consider a ostage pipeline with stage delay t
stage
One set of inputs is applied to the pipeline at time t
1
At time t
1
+ t
stage
+ t, partial results are safely stored in latches
Apply the next set of inputs at time t
2
satisfying t
2
> t
1
+ t
stage
+ t
Therefore:
Clock period = At = t
2
– t
1
> t
stage
+ t
Throughput = 1/ Clock period s 1/(t
stage
+ t)
Fig. 25.1
May 2010 Computer Arithmetic, Implementation Topics Slide 11
In Out
1
. . .
Interstage latches
Input
latches
Output
latches
In
Out
Nonpipelined
t/ +t
2
o
3
o
t + ot
t
The Effect of Clock Skew on Pipeline Throughput
Two implicit assumptions in deriving the throughput equation below:
One clock signal is distributed to all circuit elements
All latches are clocked at precisely the same time
Throughput = 1/ Clock period
s 1/(t
stage
+ t)
Fig. 25.1
Uncontrolled or random clock skew
causes the clock signal to arrive at
point B before/after its arrival at point A
With proper design, we can place a bound ±c on the uncontrolled
clock skew at the input and output latches of a pipeline stage
Then, the clock period is lower bounded as:
Clock period = At = t
2
– t
1
> t
stage
+ t + 2c
May 2010 Computer Arithmetic, Implementation Topics Slide 12
Wave Pipelining: The Idea
The stage delay t
stage
is really not a constant but varies from t
min
to t
max
t
min
represents fast paths (with fewer or faster gates)
t
max
represents slow paths
Suppose that one set of inputs is applied at time t
1
At time t
1
+ t
max
+ t, the results are safely stored in latches
If that the next inputs are applied at time t
2
, we must have:
t
2
+ t
min
> t
1
+ t
max
+ t
This places a lower bound on the clock period:
Clock period = At = t
2
– t
1
> t
max
– t
min
+ t
Thus, we can approach the maximum possible pipeline throughput of
1/t without necessarily requiring very small stage delay
All we need is a very small delay variance t
max
– t
min
Two roads to higher
pipeline throughput:
Reducing t
max
Increasing t
min
May 2010 Computer Arithmetic, Implementation Topics Slide 13
Visualizing Wave Pipelining
S
t
a
g
e
i
n
p
u
t
S
t
a
g
e
o
u
t
p
u
t
Wavefront
i + 3
(not yet applied)
Wavefront
i + 2
Wavefront
i + 1
Wavefront
i
(arriving at output)
Faster signals
Slower signals
Allowance for
latching, skew, etc.
t
– t
max
min
Fig. 25.2 Wave pipelining allows multiple computational
wavefronts to coexist in a single pipeline stage .
May 2010 Computer Arithmetic, Implementation Topics Slide 14
Another Visualization of Wave Pipelining
Fig. 25.3
Alternate view of
the throughput
advantage of
wave pipelining
over ordinary
pipelining.
Stage
output
Stage
input
Stationary
region
(unshaed)
Transient
region
(unshaed)
Clock cycle
L
o
g
i
c
d
e
p
t
h
t t
min max
Stage
output
Stage
input
Clock cycle
L
o
g
i
c
d
e
p
t
h
t t
min max
Ti me
Ti me
Controlled
clock skew
(a)
(b)
(a) Ordinary pipelining
(b) Wave pipelining
Transient
region
(shaded)
Stationary
region
(unshaded)
May 2010 Computer Arithmetic, Implementation Topics Slide 15
Difficulties in Applying Wave Pipelining
LAN and other
highspeed links
(figures rounded
from Myrinet
data [Bode95])
Sender Recei ver
Gb/s link (cable)
30 m
10 b
Gb/s throughput ÷ Clock rate = 10
8
÷ Clock cycle = 10 ns
In 10 ns, signals travel 11.5 m (speed of light = 0.3 m/ns)
For a 30 m cable, 2030 characters will be in flight at the same time
At the circuit and logic level (µmmm distances, not m), there are still
problems to be worked out
For example, delay equalization to reduce t
max
– t
min
is nearly
impossible in CMOS technology:
CMOS 2input NAND delay varies by factor of 2 based on inputs
Biased CMOS (pseudoCMOS) fairs better, but has power penalty
May 2010 Computer Arithmetic, Implementation Topics Slide 16
Controlled Clock Skew in Wave Pipelining
With wave pipelining, a new input enters the pipeline stage every
At time units and the stage latency is t
max
+ t
Thus, for proper sampling of the results, clock application at the
output latch must be skewed by (t
max
+ t) mod At
Example: t
max
+ t = 12 ns; At = 5 ns
A clock skew of +2 ns is required at the stage output latches relative
to the input latches
In general, the value of t
max
– t
min
> 0 may be different for each stage
At > max
i=1 to o
[t
max
(i)
– t
min
(i)
+ t]
The controlled clock skew at the output of stage i needs to be:
S
(i)
= ¿
j=1 to i
[t
max
(i)
– t
min
(i)
+ t] mod At
May 2010 Computer Arithmetic, Implementation Topics Slide 17
Random Clock Skew in Wave Pipelining
Clock period = At = t
2
– t
1
> t
max
– t
min
+ t + 4c
Reasons for the term 4c:
Clocking of the first input set may
lag by c, while that of the second
set leads by c (net difference = 2c)
The reverse condition may exist at
the output side
Uncontrolled skew has a larger
effect on wave pipelining than on
standard pipelining, especially
when viewed in relative terms
Stage
output
Stage
input
Clock cycle
L
o
g
i
c
d
e
p
t
h
Stage
output
Stage
input
Clock cycle
L
o
g
i
c
d
e
p
t
h
Ti me
Ti me
c c
c c
Graphical justification
of the term 4c
May 2010 Computer Arithmetic, Implementation Topics Slide 18
25.3 The Earle Latch
Example: To latch d = vw + xy,
substitute for d in the latch equation
z = dC + dz +Cz
to get a combined “logic + latch” circuit
implementing z = vw + xy
z = (vw + xy)C + (vw + xy)z +Cz
= vwC + xyC + vwz + xyz +Cz
d
C
z
w
x
y
_
C
Fig. 25.4 Twolevel ANDOR
realization of the Earle latch.
C
C
z
v
w
x
y
_
Fig. 25.5 Twolevel ANDOR
latched realization of the
function z = vw + xy.
Earle latch can be merged with a
preceding 2level ANDOR logic
May 2010 Computer Arithmetic, Implementation Topics Slide 19
Clocking Considerations for Earle Latches
We derived constraints on the maximum clock rate 1/At
Clock period At has two parts: clock high, and clock low
At = C
high
+ C
low
Consider a pipeline stage between Earle latches
C
high
must satisfy the inequalities
3o
max
– o
min
+ S
max
(C,C+) s C
high
s 2o
min
+ t
min
o
max
and o
min
are maximum and minimum gate delays
S
max
(C,C+) > 0 is the maximum skew between C andC+
Clock must go low
before the fastest
signals from the
next input data set
can affect the input
z of the latch
The clock pulse must be
wide enough to ensure
that valid data is stored in
the output latch and to
avoid logic hazard should
C slightly lead C
_
May 2010 Computer Arithmetic, Implementation Topics Slide 20
25.4 Parallel and DigitSerial Pipelines
×
+
×
÷
×
/
\
\
/
×
÷
+
Pipelining period
Latency
t = 0
Latch positions in a fourstage pipeline
a
b
c
d
e
f
z
Output
available
Ti me
(a + b) c d
e ÷ f
Fig. 25.6 Flowgraph representation of an arithmetic expression
and timing diagram for its evaluation with digitparallel computation.
May 2010 Computer Arithmetic, Implementation Topics Slide 21
Feasibility of BitLevel or DigitLevel Pipelining
Bitserial addition and multiplication can be done LSBfirst, but
division and squarerooting are MSBfirst operations
Besides, division can’t be done in pipelined bitserial fashion,
because the MSB of the quotient q in general depends on all the
bits of the dividend and divisor
Example: Consider the decimal division .1234/.2469
Solution: Redundant number representation!
.1xxx
 = .?xxx
.2xxx
.12xx
 = .?xxx
.24xx
.123x
 = .?xxx
.246x
May 2010 Computer Arithmetic, Implementation Topics Slide 22
25.5 OnLine or DigitPipelined Arithmetic
×
\
/
×
÷
+
t = 0
Output
available
Ti me
(a + b) c d
e ÷ f
×
\
/
×
÷
+
t = 0
Output
complete
Output
Operation
latencies
Begin next
computation
Digitparallel
Digitserial
Fig. 25.7 Digitparallel versus
digitpipelined computation.
May 2010 Computer Arithmetic, Implementation Topics Slide 23
DigitPipelined Adders
Fig. 25.8
Digitpipelined
MSDfirst
carryfree
addition.
Decimal example:
.1 8
.4 2

.5
Shaded boxes show the
"unseen" or unprocessed
parts of the operands and
unknown part of the sum
+
x
y
t
w
s
w
–i+1
–i+1
–i+1
–i
–i
–i
Latch
Latch
(interim sum)

Fig. 25.9
Digitpipelined
MSDfirst
limitedcarry
addition.
BSD example:
.1 0 1
.0 1 1

.1
Shaded boxes show the
"unseen" or unprocessed
parts of the operands and
unknown part of the sum
+
x
y
e
p
s
p
–i+2
–i+1
–i+1
–i
–i
–i
Latch
Latch
(position sum)
w
–i+1
(interim sum)
w
–i+2
Latch
t
–i+2

May 2010 Computer Arithmetic, Implementation Topics Slide 24
DigitPipelined Multiplier: Algorithm Visualization
Fig. 25.10 Digitpipelined
MSDfirst multiplication
process.
. 1 0 1
. 1 1 1
. 1 0 1
. 1 0 1
. 1 0 1

 
. 0
a
x
Already
processed
Being
processed
Not yet
known
×
May 2010 Computer Arithmetic, Implementation Topics Slide 25
DigitPipelined Multiplier: BSD Implementation
Fig. 25.11
Digitpipelined
MSDfirst BSD
multiplier.
a
Mux
–1 1 0
0
x
Mux
0
p
3operand carryfree adder
Partial Multiplicand Partial Multiplier
Product Residual
Shift
–i+2
–i
–i
–1 1 0
MSD
May 2010 Computer Arithmetic, Implementation Topics Slide 26
DigitPipelined Divider
Table 25.1 Example of digitpipelined division showing
that three cycles of delay are necessary before quotient
digits can be output (radix = 4, digit set = [–2, 2])
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Cycle Dividend Divisor q Range q
–1
Range
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
1 (.0 . . .)
four
(.1 . . .)
four
(–2/3, 2/3) [–2, 2]
2 (.0 0 . . .)
four
(.1
–
2 . . .)
four
(–2/4, 2/4) [–2, 2]
3 (.0 0 1 . . .)
four
(.1
–
2
–
2 . . .)
four
(1/16, 5/16) [0, 1]
4 (.0 0 1 0 . . .)
four
(.1
–
2
–
2
–
2 . . .)
four
(10/64, 14/64) 1
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
May 2010 Computer Arithmetic, Implementation Topics Slide 27
DigitPipelined SquareRooter
Table 25.2 Examples of digitpipelined squareroot computation
showing that 12 cycles of delay are necessary before root digits can
be output (radix = 10, digit set = [–6, 6], and radix = 2, digit set = [–1, 1])
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
Cycle Radicand q Range q
–1
Range
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
1 (.3 . . .)
ten
(\ 7/30 , \ 11/30 ) [5, 6]
2 (.3 4 . . .)
ten
(\ 1/3 , \ 26/75 ) 6
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
1 (.0 . . .)
two
(0, \ 1/2 ) [–2, 2]
2 (.0 1 . . .)
two
(0, \ 1/2 ) [0, 1]
3 (.0 1 1 . . .)
two
(1/2, \ 1/2 ) 1
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
May 2010 Computer Arithmetic, Implementation Topics Slide 28
DigitPipelined Arithmetic: The Big Picture
Fig. 25.12 Conceptual view of onLine
or digitpipelined arithmetic.
Output already
produced
Residual
Processed
input parts
Unprocessed
input parts
Online arithmetic unit
May 2010 Computer Arithmetic, Implementation Topics Slide 29
25.6 Systolic Arithmetic Units
Systolic arrays: Cellular circuits in which data elements
Enter at the boundaries
Advance from cell to cell in lock step
Are transformed in an incremental fashion
Leave from the boundaries
Systolic design mitigates the effect of signal propagation delay and
allows the use of very clock rates
Fig. 25.13 Highlevel design of a
systolic radix4 digitpipelined multiplier.
a
x
–i
–i
. . .
. . .
. . .
p
–i+1
Head
Cell
May 2010 Computer Arithmetic, Implementation Topics Slide 30
Case Study: Systolic Programmable FIR Filters
Fig. 25.14 Conventional and systolic
realizations of a programmable FIR filter.
(a) Conventional: Broadcast control, broadcast data
(b) Systolic: Pipelined control, pipelined data
May 2010 Computer Arithmetic, Implementation Topics Slide 31
26 LowPower Arithmetic
Chapter Goals
Learn how to improve the power efficiency
of arithmetic circuits by means of
algorithmic and logic design strategies
Chapter Highlights
Reduced power dissipation needed due to
 Limited source (portable, embedded)
 Difficulty of heat disposal
Algorithm and logiclevel methods: discussed
Technology and circuit methods: ignored here
May 2010 Computer Arithmetic, Implementation Topics Slide 32
LowPower Arithmetic: Topics
Topics in This Chapter
26.1 The Need for LowPower Design
26.2 Sources of Power Consumption
26.3 Reduction of Power Waste
26.4 Reduction of Activity
26.5 Transformations and Tradeoffs
26.6 New and Emerging Methods
May 2010 Computer Arithmetic, Implementation Topics Slide 33
26.1 The Need for LowPower Design

Portable and wearable electronic devices
Lithiumion batteries: 0.2 watthr per gram of weight
Practical battery weight < 500 g (< 50 g if wearable device)
Total power ~ 510 watt for a day’s work between recharges
Modern highperformance microprocessors use 100s watts
Power is proportional to die area × clock frequency
Cooling of micros difficult, but still manageable
Cooling of MPPs and server farms is a BIG challenge
New battery technologies cannot keep pace with demand
Demand for more speed and functionality (multimedia, etc.)
May 2010 Computer Arithmetic, Implementation Topics Slide 34
Processor Power Consumption Trends
Fig. 26.1 Power consumption trend in DSPs [Raba98].
P
o
w
e
r
c
o
n
s
u
m
p
t
i
o
n
p
e
r
M
I
P
S
(
W
)
The factorof100 improvement
per decade in energy efficiency
has been maintained since 2000
May 2010 Computer Arithmetic, Implementation Topics Slide 35
26.2 Sources of Power Consumption
Both average and peak power are important
Average power determines battery life or heat dissipation
Peak power impacts power distribution and signal integrity
Typically, lowpower design aims at reducing both
Power dissipation in CMOS digital circuits
Static: Leakage current in imperfect switches (< 10%)
Dynamic: Due to (dis)charging of parasitic capacitance
P
avg
~ o f C V
2
data rate
(clock frequency)
“activity”
Capacitance
Square of
voltage
May 2010 Computer Arithmetic, Implementation Topics Slide 36
Power Reduction Strategies: The Big Picture
P
avg
~ o f C V
2
For a given data rate f, there are but 3 ways
to reduce the power requirements:
1. Using a lower supply voltage V
2. Reducing the parasitic capacitance C
3. Lowering the switching activity o
Example: A 32bit offchip bus operates at 5 V and 100 MHz and
drives a capacitance of 30 pF per bit. If random values were put
on the bus in every cycle, we would have o = 0.5. To account for
data correlation and idle bus cycles, assume o = 0.2. Then:
P
avg
~ o f C V
2
= 0.2 × 10
8
× (32 × 30 × 10
–12
) × 5
2
= 0.48 W
May 2010 Computer Arithmetic, Implementation Topics Slide 37
26.3 Reduction of Power Waste
Function
Unit
Clock
Enable
Data Inputs
Data Outputs
Function
Unit
FU Inputs
FU Output
Mux
Select
Latches
0
1
Fig. 26.3 Saving power via guarded evaluation.
Fig. 26.2 Saving power through clock gating.
May 2010 Computer Arithmetic, Implementation Topics Slide 38
Glitching and Its Impact on Power Waste
s
i
x
i
y
i
c
i
c
0
Carry propagation p
i
x
i
y
i
s
i
c
i
Fig. 26.4 Example of glitching in a ripplecarry adder.
May 2010 Computer Arithmetic, Implementation Topics Slide 39
Array Multipliers with Lower Power Consumption
p
0
p
1
p
2
p
3
p
4
p
6
p
7
p
8
0 0 0
p
9
p
5
0 0
0
0
0
0
0
a
0
a
1
a
2
a
3
a
4
x
4
x
3
x
2
x
1
x
0
Carry
Sum
Fig. 26.5 An array multiplier with gated FA cells.
May 2010 Computer Arithmetic, Implementation Topics Slide 40
26.4 Reduction of Activity
Fig. 26.6 Reduction of activity
by precomputation.
Arit hmet ic
Circuit
m bits n – m bits
Precomputat ion
n inputs
Out put
Load enable
n – 1 inputs
Function
Unit
for x = 0
Function
Unit
for x = 1
Select
x
n–1
Mux
0 1
n–1 n–1
Fig. 26.7 Reduction of activity
via Shannon expansion.
May 2010 Computer Arithmetic, Implementation Topics Slide 41
26.5 Transformations and Tradeoffs
Fig. 26.8 Reduction of power via parallelism or pipelining.
Clock
Arit hmetic
Circuit
f
Frequency = f
Capacit ance = C
Volt age = V
Power = P
Frequency = 0.5f
Capacit ance = 2.2C
Volt age = 0.6V
Power = 0.396P
Frequency = f
Capacit ance = 1.2C
Volt age = 0.6V
Power = 0.432P
Circuit
Copy 1
Circuit
Copy 2
Mux
Circuit
St age 1
Circuit
St age 2
Clock
f
Regist er
Input Reg.
Input Reg.
Output Reg.
Output Reg.
Clock
f
Select
Input Reg.
Clock
f
Output Reg.
May 2010 Computer Arithmetic, Implementation Topics Slide 42
Unrolling of Iterative Computations
Fig. 26.9 Realization of a firstorder IIR filter.
x
(i)
×
+
×
a
b
y
(i–1)
y
(i)
x
×
+
× a
b
y
(i)
×
+
×
a
b
y
y
(i–2)
2
×
ab
y
(i–3)
x
(i–1)
(i)
(i–1)
(a) Simple (b) Unrolled once
May 2010 Computer Arithmetic, Implementation Topics Slide 43
Retiming for Power Efficiency
Fig. 26.10 Possible realizations of a fourthorder FIR filter.
x
(i)
a
x
(i–1)
x
(i–3)
b
d
×
×
×
+
+
y
(i)
y
(i–1)
x
(i–2)
c
×
+
x
(i)
a
b
d
×
×
×
+
+
y
(i)
y
(i–1)
c
×
+
u
(i)
v
(i–1)
w
(i–1)
u
(i–1)
(a) Original (b) Retimed
May 2010 Computer Arithmetic, Implementation Topics Slide 44
26.6 New and Emerging Methods
Local
Control
Local
Control
Local
Control
Arit hmetic
Circuit
Arit hmetic
Circuit
Arit hmetic
Circuit
Data ready
Data
Release
Fig. 26.11 Part of an asynchronous
chain of computations.
Dualrail data encoding with
transition signaling:
Two wires per signal
Transition on wire 0 (1) indicates
the arrival of 0 (1)
Dualrail design does increase
the wiring density, but it offers
the advantage of complete
insensitivity to delays
May 2010 Computer Arithmetic, Implementation Topics Slide 45
The Ultimate in LowPower Design
Fig. 26.12
Some reversible
logic gates.
A
B
C
P = A
Q = B
R = A B © C
TG
(a) Toffoli gate
A
B
C
FRG
(b) Fredkin gate
A
B
P = A
Q = A © B
FG
(c) Feynman gate
A
B
C
P = A
R = A B © C
PG
Q = A © B
(d) Peres gate
P = A
R = A C © A B
Q = A B © A C
B
0
C
1
0
+
A
C
out
s
(sum)
B
A
G
s
Fig. 26.13 Reversible
binary full adder built
of 5 Fredkin gates, with
a single Feynman gate
used to fan out the
input B. The label “G”
denotes “garbage.”
May 2010 Computer Arithmetic, Implementation Topics Slide 46
27 FaultTolerant Arithmetic
Chapter Goals
Learn about errors due to hardware faults
or hostile environmental conditions,
and how to deal with or circumvent them
Chapter Highlights
Modern components are very robust, but . . .
put millions / billions of them together
and something is bound to go wrong
Can arithmetic be protected via encoding?
Reliable circuits and robust algorithms
May 2010 Computer Arithmetic, Implementation Topics Slide 47
FaultTolerant Arithmetic: Topics
Topics in This Chapter
27.1 Faults, Errors, and Error Codes
27.2 Arithmetic ErrorDetecting Codes
27.3 Arithmetic ErrorCorrecting Codes
27.4 SelfChecking Function Units
27.5 AlgorithmBased Fault Tolerance
27.6 FaultTolerant RNS Arithmetic
May 2010 Computer Arithmetic, Implementation Topics Slide 48
27.1 Faults, Errors, and Error Codes
Protected
by
Encoding
Input
Encode
Send
Store
Send
Decode
Output
Manipulate
Unprotected
Protected
by
enc oding
Fig. 27.1
A common way of
applying information
coding techniques.
May 2010 Computer Arithmetic, Implementation Topics Slide 49
Fault Detection and Fault Masking
Coded
inputs
Decode
1
Decode
2
ALU
1
ALU
2
Compare
Mismat ch
detect ed
Encode
Coded
outputs
Coded
inputs
Decode
1
Decode
2
ALU
1
ALU
2
Decode
3
ALU
3
Vot e
Encode
Coded
outputs
Noncodeword
detect ed
Fig. 27.2 Arithmetic
fault detection or fault
tolerance (masking)
with replicated units.
(a) Duplication and comparison
(b) Triplication and voting
May 2010 Computer Arithmetic, Implementation Topics Slide 50
Inadequacy of Standard Error Coding Methods
Unsigned addition 0010 0111 0010 0001
+ 0101 1000 1101 0011
–––––––––––––––––
Correct sum 0111 1111 1111 0100
Erroneous sum 1000 0000 0000 0100

Stage generating an
erroneous carry of 1
Fig. 27.3 How a single
carry error can produce
an arbitrary number of
biterrors (inversions).
The arithmetic weight of an error: Min number of signed powers of 2 that
must be added to the correct value to produce the erroneous result
Example 1 Example 2
 
Correct value 0111 1111 1111 0100 1101 1111 1111 0100
Erroneous value 1000 0000 0000 0100 0110 0000 0000 0100
Difference (error) 16 = 2
4
–32752 = –2
15
+ 2
4
Minweight BSD 0000 0000 0001 0000
–
1000 0000 0001 0000
Arithmetic weight 1 2
Error type Single, positive Double, negative
May 2010 Computer Arithmetic, Implementation Topics Slide 51
27.2 Arithmetic ErrorDetecting Codes
Arithmetic errordetecting codes:
Are characterized by arithmetic weights of detectable errors
Allow direct arithmetic on coded operands
We will discuss two classes of arithmetic errordetecting codes,
both of which are based on a check modulus A (usually a small
odd number)
Product or AN codes
Represent the value N by the number AN
Residue (or inverse residue) codes
Represent the value N by the pair (N, C),
where C is N mod A or (N – N mod A) mod A
May 2010 Computer Arithmetic, Implementation Topics Slide 52
Product or AN Codes
For odd A, all weight1 arithmetic errors are detected
Arithmetic errors of weight > 2 may go undetected
e.g., the error 32 736 = 2
15
– 2
5
undetectable with A = 3, 11, or 31
Error detection: check divisibility by A
Encoding/decoding: multiply/divide by A
Arithmetic also requires multiplication and division by A
Product codes are nonseparate (nonseparable) codes
Data and redundant check info are intermixed
May 2010 Computer Arithmetic, Implementation Topics Slide 53
LowCost Product Codes
Lowcost product codes use lowcost check moduli of the form A = 2
a
– 1
Multiplication by A = 2
a
– 1: done by shiftsubtract
Division by A = 2
a
– 1: done a bits at a time as follows
Given y = (2
a
– 1)x, find x by computing 2
a
x – y
. . . xxxx 0000 – . . . xxxx xxxx = . . . xxxx xxxx
Unknown 2
a
x Known (2
a
– 1)x Unknown x
Theorem 27.1: Any unidirectional error with arithmetic weight of at most
a – 1 is detectable by a lowcost product code based on A = 2
a
– 1
May 2010 Computer Arithmetic, Implementation Topics Slide 54
Arithmetic on ANCoded Operands
Add/subtract is done directly: Ax ± Ay = A(x ± y)
Direct multiplication results in: Aa × Ax = A
2
ax
The result must be corrected through division by A
For division, if z = qd + s, we have: Az = q(Ad) + As
Thus, q is unprotected
Possible cure: premultiply the dividend Az by A
The result will need correction
Square rooting leads to a problem similar to division
¸\ A
2
x ¸ = ¸ A\ x ¸ which is not the same as A ¸\ x ¸
May 2010 Computer Arithmetic, Implementation Topics Slide 55
Residue and Inverse Residue Codes
Represent N by the pair (N, C(N)), where C(N) = N mod A
Residue codes are separate (separable) codes
Separate data and check parts make decoding trivial
Encoding: given N, compute C(N) = N mod A
Lowcost residue codes use A = 2a – 1
Arithmetic on residuecoded operands
Add/subtract: data and check parts are handled separately
(x, C(x)) ± (y, C(y)) = (x ± y, (C(x) ± C(y)) mod A)
Multiply
(a, C(a)) × (x, C(x)) = (a × x, (C(a)×C(x)) mod A)
Divide/squareroot: difficult
May 2010 Computer Arithmetic, Implementation Topics Slide 56
Arithmetic on ResidueCoded Operands
Add/subtract: Data and check parts are handled separately
(x, C(x)) ± (y, C(y)) = (x ± y, (C(x) ± C(y)) mod A)
Multiply
(a, C(a)) × (x, C(x)) = (a × x, (C(a)×C(x)) mod A)
Divide/squareroot: difficult
Main
Arithmetic
Processor
Check
Processor
x
y
C(x)
C(y)
z
Compare
mod
C(z)
Error
Indicator
A
Fig. 27.4
Arithmetic processor
with residue checking.
May 2010 Computer Arithmetic, Implementation Topics Slide 57
Example: Residue Checked Adder
Add
x, x mod A
Add mod A
Compare
Find
mod A
y, y mod A
s, s mod A
Error
Not
equal
May 2010 Computer Arithmetic, Implementation Topics Slide 58
27.3 Arithmetic ErrorCorrecting Codes
––––––––––––––––––––––––––––––––––––––––
Positive Syndrome Negative Syndrome
error mod 7 mod 15 error mod 7 mod 15
––––––––––––––––––––––––––––––––––––––––
1 1 1 –1 6 14
2 2 2 –2 5 13
4 4 4 –4 3 11
8 1 8 –8 6 7
16 2 1 –16 5 14
32 4 2 –32 3 13
64 1 4 –64 6 11
128 2 8 –128 5 7
256 4 1 –256 3 14
512 1 2 –512 6 13
1024 2 4 –1024 5 11
2048 4 8 –2048 3 7
––––––––––––––––––––––––––––––––––––––––
4096 1 1 –4096 6 14
8192 2 2 –8192 5 13
16,384 4 4 –16,384 3 11
32,768 1 8 –32,768 6 7
––––––––––––––––––––––––––––––––––––––––
Table 27.1
Error syndromes for
weight1 arithmetic
errors in the (7, 15)
biresidue code
Because all the
symptoms in this
table are different,
any weight1
arithmetic error is
correctable by the
(mod 7, mod 15)
biresidue code
May 2010 Computer Arithmetic, Implementation Topics Slide 59
Properties of Biresidue Codes
Biresidue code with relatively prime lowcost check moduli
A = 2
a
– 1 and B = 2
b
– 1 supports a × b bits of data for
weight1 error correction
Representational redundancy = (a + b)/(ab) = 1/a + 1/b
May 2010 Computer Arithmetic, Implementation Topics Slide 60
27.4 SelfChecking Function Units
Selfchecking (SC) unit: any fault from a prescribed set does not
affect the correct output (masked) or leads to a noncodeword
output (detected)
An invalid result is:
Detected immediately by a code checker, or
Propagated downstream by the next selfchecking unit
To build SC units, we need SC code checkers that never
validate a noncodeword, even when they are faulty
May 2010 Computer Arithmetic, Implementation Topics Slide 61
Design of a SelfChecking Code Checker
Example: SC checker for inverse residue code (N, C' (N))
N mod A should be the bitwise complement of C' (N)
Verifying that signal pairs (x
i
, y
i
) are all (1, 0) or (0, 1) is the
same as finding the AND of Boolean values encoded as:
1: (1, 0) or (0, 1) 0: (0, 0) or (1, 1)
x
y
i
i
x
y
j
j
Fig. 27.5 Twoinput
AND circuit, with 2bit
inputs (x
i
, y
i
) and (x
i
, y
i
),
for use in a selfchecking
code checker.
May 2010 Computer Arithmetic, Implementation Topics Slide 62
Case Study: SelfChecking Adders
P/R = Paritytoredundant
converter
R/P = Redundanttoparity
converter
Fig. 27.6
Selfchecking adders
with parityencoded
inputs and output.
/
k
/
k
/
k
Parity
encoded
inputs
ALU
Error
Parity
encoded
output
Parity
generator
Ordinary ALU
(b) Parity prediction
Parity
predictor
/
k
Parity
encoded
inputs
ALU
k + h
/
P/R
k + h
/
/
k
Parity
encoded
output
R/P
/
k
P/R
k + h
/
Redundant paritypreservi ng ALU
(c) Parity/redundant and redundant/parity code conversion
(a) Parity prediction
(b) Parity preservation
May 2010 Computer Arithmetic, Implementation Topics Slide 63
27.5 AlgorithmBased Fault Tolerance
Alternative strategy to error detection after each basic operation:
Accept that operations may yield incorrect results
Detect/correct errors at datastructure or application level
Fig. 27.7 A 3 × 3
matrix M with its
row, column, and full
checksum matrices
M
r
, M
c
, and M
f
.
2 1 6 2 1 6 1
M = 5 3 4 M
r
= 5 3 4 4
3 2 7 3 2 7 4
2 1 6 2 1 6 1
5 3 4 5 3 4 4
3 2 7 3 2 7 4
2 6 1 2 6 1 1
M
c
= M
f
=
Example: multiplication of matrices X and Y yielding P
Row, column, and full checksum matrices (mod 8)
May 2010 Computer Arithmetic, Implementation Topics Slide 64
Properties of Checksum Matrices
Theorem 27.3: If P = X × Y , we have P
f
= X
c
× Y
r
(with floatingpoint values, the equalities are approximate)
Theorem 27.4: In a fullchecksum matrix, any single erroneous
element can be corrected and any three errors can be detected
Fig. 27.7
2 1 6 2 1 6 1
M = 5 3 4 M
r
= 5 3 4 4
3 2 7 3 2 7 4
2 1 6 2 1 6 1
5 3 4 5 3 4 4
3 2 7 3 2 7 4
2 6 1 2 6 1 1
M
c
= M
f
=
May 2010 Computer Arithmetic, Implementation Topics Slide 65
27.6 FaultTolerant RNS Arithmetic
Residue number systems allow very elegant and effective error detection
and correction schemes by means of redundant residues (extra moduli)
Example: RNS(8  7  5  3), Dynamic range M = 8 × 7 × 5 × 3 = 840;
redundant modulus: 11. Any error confined to a single residue detectable.
Error detection (the redundant modulus must be the largest one, say m):
1. Use other residues to compute the residue of the number mod m
(this process is known as base extension)
2. Compare the computed and actual modm residues
The beauty of this method is that arithmetic algorithms are completely
unaffected; error detection is made possible by simply extending the
dynamic range of the RNS
May 2010 Computer Arithmetic, Implementation Topics Slide 66
Example RNS with two Redundant Residues
RNS(8  7  5  3), with redundant moduli 13 and 11
Representation of 25 = (12, 3, 1, 4, 0, 1)
RNS
Corrupted version = (12, 3, 1, 6, 0, 1)
RNS
Transform (–,–,1,6,0,1) to (5,1,1,6,0,1) via base extension
Reconstructed number = ( 5, 1, 1, 6, 0, 1)
RNS
The difference between the first two components of the corrupted
and reconstructed numbers is (+7, +2)
This constitutes a syndrome, allowing us to correct the error
May 2010 Computer Arithmetic, Implementation Topics Slide 67
28 Reconfigurable Arithmetic
Chapter Goals
Examine arithmetic algorithms and designs
appropriate for implementation on FPGAs
(oneofakind, lowvolume, prototype systems)
Chapter Highlights
Suitable adder designs beyond ripplecarry
Design choices for multipliers and dividers
Tablebased and “distributed” arithmetic
Techniques for function evaluation
Enhanced FPGAs and higherlevel alternatives
May 2010 Computer Arithmetic, Implementation Topics Slide 68
Reconfigurable Arithmetic: Topics
Topics in This Chapter
28.1 Programmable Logic Devices
28.2 Adder Designs for FPGAs
28.3 Multiplier and Divider Designs
28.4 Tabular and Distributed Arithmetic
28.5 Function Evaluation on FPGAs
28.6 Beyond FineGrained Devices
May 2010 Computer Arithmetic, Implementation Topics Slide 69
28.1 Programmable Logic Devices
Fig. 28.1 Examples of programmable sequential logic.
(a) Portion of PAL with storabl e output (b) Generic structure of an FPGA
8i nput
ANDs
D
C
Q
Q
FF
Mux
Mux
0 1
0 1
I/O bl ocks
Confi gurabl e
logic bl ock
Programmable
connections
CLB
CLB
CLB
CLB
LB LB
LB LB
LB LB LB LB
LB LB LB LB
LB
LB
LB LB LB
LB LB LB LB
I/O block
Programmable
interconnects
Logic block
(or LB cluster)
May 2010 Computer Arithmetic, Implementation Topics Slide 70
Programmability Mechanisms
Fig. 28.2 Some memorycontrolled switches and
interconnections in programmable logic devices.
(a) Tristate buffer
0
1
(b) Pass transistor (c) Multiplexer
Memory
cell
Memory
cell
Memory
cell
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 71
Configurable Logic Blocks
Fig. 28.3 Structure of a simple logic block.
Inputs
FF
Carryin
Carryout
Outputs
Logic
or
LUT
0
1
0 1
0
1
2
1
0
0
1
2
3
4
1
y
0
y
1
y
2
x
0
x
1
x
2
x
3
x
4
0
1
May 2010 Computer Arithmetic, Implementation Topics Slide 72
The Interconnect Fabric
Fig. 28.4 A possible
arrangement of
programmable
interconnects between
LBs or LB clusters.
LB or
cluster
Vertical wiring channels
LB or
cluster
LB or
cluster
LB or
cluster
LB or
cluster
LB or
cluster
LB or
cluster
LB or
cluster
LB or
cluster
Switch
box
Switch
box
Switch
box
Horizontal
wiring
channels
Switch
box
May 2010 Computer Arithmetic, Implementation Topics Slide 73
Standard FPGA Design Flow
1. Specification: Creating the design files, typically via a
hardware description language such as Verilog, VHDL, or Abel
2. Synthesis: Converting the design files into interconnected
networks of gates and other standard logic circuit elements
3. Partitioning: Assigning the logic elements of stage 2 to specific
physical circuit elements that are capable of realizing them
4. Placement: Mapping of the physical circuit elements of stage 3
to specific physical locations of the target FPGA device
5. Routing: Mapping of the interconnections prescribed in stage 2
to specific physical wires on the target FPGA device
6. Configuration: Generation of the requisite bitstream file that
holds configuration bits for the target FPGA device
7. Programming: Uploading the bitstream file of stage 6 to
memory elements within the FPGA device
8. Verification: Ensuring the correctness of the final design, in
terms of both function and timing, via simulation and testing
May 2010 Computer Arithmetic, Implementation Topics Slide 74
28.2 Adder Designs for FPGAs
This slide to include a discussion of ripplecarry adders and builtin carry
chains in FPGAs
May 2010 Computer Arithmetic, Implementation Topics Slide 75
CarrySkip Addition
Fig. 28.5 Possible design of a 16bit
carryskip adder on an FPGA.
/ 5 / 5 / 5
0
1
Skip
logic
/ 5 / 6 / 6
/ 5
/ 5
c
out
c
in
Adder
Adder
Adder
Slide to be completed
/ 6
May 2010 Computer Arithmetic, Implementation Topics Slide 76
CarrySelect Addition
Fig. 28.6 Possible design of a carryselect adder on an FPGA.
/ 2
2 bits
0
1
0
1
0
1
0
1
3 bits 4 bits 6 bits 1 bit
/ 3 / 4 / 6
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 77
28.3 Multiplier and Divider Designs
Fig. 28.7 Divideandconquer 4 × 4 multiplier design
using 4input lookup tables and ripplecarry adders.
a
3
a
2
x
1
x
0
4 LUTs
a
1
a
0
x
3
x
2
a
1
a
0
x
1
x
0
a
3
a
2
x
3
x
2
p
0
p
1
0
4bit adder
6bit adder
p
2
p
3
p
4
p
5
p
6
p
7
c
out
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 78
Multiplication by Constants
Fig. 28.8 Multiplication of an 8bit input by 13, using LUTs.
x
L
8 LUTs
0
x
H
8 LUTs
4
/
/
4
/
8
13x
H
13x
13x
L
8bit adder
Slide to be completed
0
May 2010 Computer Arithmetic, Implementation Topics Slide 79
Division on FPGAs
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 80
28.4 Tabular and Distributed Arithmetic
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 81
SecondOrder Digital Filter: Definition
Current and two previous inputs
y
(i)
= a
(0)
x
(i)
+ a
(1)
x
(i–1)
+ a
(2)
x
(i–2)
– b
(1)
y
(i–1)
– b
(2)
y
(i–2)
Expand the equation for y
(i)
in terms of the bits in operands
x = (x
0
.x
–1
x
–2
. . . x
–l
)
2’scompl
and y = (y
0
.y
–1
y
–2
. . . y
–l
)
2’scompl
,
where the summations range from j = – l to j = –1
y
(i)
= a
(0)
(–x
0
(i)
+ ¿2
j
x
j
(i)
)
+ a
(1)
(–x
0
(i÷1)
+ ¿2
j
x
j
(i÷1)
) + a
(2)
(–x
0
(i÷2)
+ ¿2
j
x
j
(i÷2)
)
– b
(1)
(–y
0
(i÷1)
+ ¿2
j
y
j
(i÷1)
) – b
(2)
(–y
0
(i÷2)
+ ¿2
j
y
j
(i÷2)
)
Filter
Latch
x
(1)
x
(2)
x
(3)
x
(i)
.
.
.
y
(1)
y
(2)
y
(3)
y
(i)
.
.
.
x
(i+1)
Two previous outputs
a
(j)
s and b
(j)
s
are constants
Define f(s, t, u, v, w) = a
(0)
s + a
(1)
t + a
(2)
u – b
(1)
v – b
(2)
w
y
(i)
= ¿2
j
f(x
j
(i)
, x
j
(i÷1)
, x
j
(i÷2)
, y
j
(i÷1)
, y
j
(i÷2)
) – f(x
0
(i)
, x
0
(i÷1)
, x
0
(i÷2)
, y
0
(i÷1)
, y
0
(i÷2)
)
May 2010 Computer Arithmetic, Implementation Topics Slide 82
SecondOrder Digital Filter: BitSerial Implementation
Fig. 28.9 Bitserial tabular realization of a secondorder filter.
f
x
x
x
(i)
(i–1)
(i–2)
j
j
j
y
(i–1)
j
y
(i–2)
j
LSBfirst
y
(i)
±
Input
32Entry
Table
(ROM)
Out put
Shift
Regist er
(m+3)Bit
Register
Dat a Out
Address In
s
RightShift
LSBfirst
Output
Shift
Reg.
Shift
Reg.
Shift
Reg.
Shift
Reg.
Regist er
i th
input
(i – 1) th
input
(i – 2) th
input
(i – 1) th
output
i th output
being formed
(i – 2) th
output
Copy at
the end
of cycle
32entry
lookup
table
May 2010 Computer Arithmetic, Implementation Topics Slide 83
28.5 Function Evaluation on FPGAs
Fig. 28.10 The first four
stages of an unrolled
CORDIC processor.
Add/Sub
Add/Sub
Add/Sub
Add/Sub
>> 1
>> 1
Add/Sub
Add/Sub
>> 2
>> 2
Add/Sub
Add/Sub
>> 3
>> 3
Add/Sub
Add/Sub
Add/Sub
Add/Sub
Sign logic
Sign logic
Sign logic
Sign logic
e
(0)
e
(1)
e
(2)
e
(3)
x y z
z
(4)
x
(4)
y
(4)
z
(3)
x
(3)
y
(3)
x
(2)
y
(2)
x
(1)
y
(1)
z
(2)
z
(1)
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 84
Implementing Convergence Schemes
Fig. 28.11 Generic convergence structure for function evaluation.
Lookup
table
x
Convergence
step
y
(0)
~ f(x)
Convergence
step
Convergence
step
y
(1)
y
(2)
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 85
28.6 Beyond FineGrained Devices
Fig. 28.12 The design space for arithmeticintensive applications.
Word widt h (bits)
1 4 64 16 256 1024
1
4
64
16
256
1024
Instruction depth
FPGA
DPGA
GP
micro
Our approach
MPP
SP
processor
General
purpose
processor
Special
purpose
processor
Fieldprogrammable
arithmetic array
1024
Slide to be completed
May 2010 Computer Arithmetic, Implementation Topics Slide 86
A Past, Present, and Future
Appendix Goals
Wrap things up, provide perspective, and
examine arithmetic in a few key systems
Appendix Highlights
One must look at arithmetic in context of
 Computational requirements
 Technological constraints
 Overall system design goals
 Past and future developments
Current trends and research directions?
May 2010 Computer Arithmetic, Implementation Topics Slide 87
Past, Present, and Future: Topics
Topics in This Chapter
A.1 Historical Perspective
A.2 Early HighPerformance Computers
A.3 Deeply Pipelined Vector Machines
A.4 The DSP Revolution
A.5 Supercomputers on Our Laps
A.6 Trends, Outlook, and Resources
May 2010 Computer Arithmetic, Implementation Topics Slide 88
A.1 Historical Perspective
Babbage was aware of ideas such as
carryskip addition, carrysave addition,
and restoring division
1848
Modern reconstruction from Meccano parts;
http://www.meccano.us/difference_engines/
May 2010 Computer Arithmetic, Implementation Topics Slide 89
Computer Arithmetic in the 1940s
Machine arithmetic was crucial in proving the feasibility of computing
with storedprogram electronic devices
Hardware for addition/subtraction, use of complement representation,
and shiftadd multiplication and division algorithms were developed and
finetuned
A seminal report by A.W. Burkes, H.H. Goldstein, and J. von Neumann
contained ideas on choice of number radix, carry propagation chains,
fast multiplication via carrysave addition, and restoring division
State of computer arithmetic circa 1950:
Overview paper by R.F. Shaw [Shaw50]
May 2010 Computer Arithmetic, Implementation Topics Slide 90
Computer Arithmetic in the 1950s
The focus shifted from feasibility to algorithmic speedup methods and
costeffective hardware realizations
By the end of the decade, virtually all important fastadder designs had
already been published or were in the final phases of development
Residue arithmetic, SRT division, CORDIC algorithms were proposed
and implemented
Snapshot of the field circa 1960:
Overview paper by O.L. MacSorley [MacS61]
May 2010 Computer Arithmetic, Implementation Topics Slide 91
Computer Arithmetic in the 1960s
Tree multipliers, array multipliers, highradix dividers, convergence
division, redundant signeddigit arithmetic were introduced
Implementation of floatingpoint arithmetic operations in hardware or
firmware (in microprogram) became prevalent
Many innovative ideas originated from the design of early
supercomputers, when the demand for high performance,
along with the still high cost of hardware, led designers to novel
and costeffective solutions
Examples reflecting the sate of the art near the end of this decade:
IBM’s System/360 Model 91 [Ande67]
Control Data Corporation’s CDC 6600 [Thor70]
May 2010 Computer Arithmetic, Implementation Topics Slide 92
Computer Arithmetic in the 1970s
Advent of microprocessors and vector supercomputers
Early LSI chips were quite limited in the number of transistors or logic
gates that they could accommodate
Microprogrammed control (with just a hardware adder) was a natural
choice for singlechip processors which were not yet expected to offer
high performance
For high end machines, pipelining methods were perfected to allow
the throughput of arithmetic units to keep up with computational
demand in vector supercomputers
Examples reflecting the state of the art near the end of this decade:
Cray 1 supercomputer and its successors
May 2010 Computer Arithmetic, Implementation Topics Slide 93
Computer Arithmetic in the 1980s
Spread of VLSI triggered a reconsideration of all arithmetic designs in
light of interconnection cost and pin limitations
For example, carrylookahead adders, thought to be illsuited to VLSI,
were shown to be efficiently realizable after suitable modifications.
Similar ideas were applied to more efficient VLSI tree and array
multipliers
Bitserial and online arithmetic were advanced to deal with severe pin
limitations in VLSI packages
Arithmeticintensive signal processing functions became driving forces
for lowcost and/or highperformance embedded hardware: DSP chips
May 2010 Computer Arithmetic, Implementation Topics Slide 94
Computer Arithmetic in the 1990s
No breakthrough design concept
Demand for performance led to finetuning of arithmetic algorithms and
implementations (many hybrid designs)
Increasing use of table lookup and tight integration of arithmetic unit and
other parts of the processor for maximum performance
Clock speeds reached and surpassed 100, 200, 300, 400, and 500 MHz
in rapid succession; pipelining used to ensure smooth flow of data
through the system
Examples reflecting the state of the art near the end of this decade:
Intel’s Pentium Pro (P6) ÷ Pentium II
Several highend DSP chips
May 2010 Computer Arithmetic, Implementation Topics Slide 95
Computer Arithmetic in the 2000s
Continued refinement of many existing methods, particularly those
based on table lookup
New challenges posed by multiGHz clock rates
Increased emphasis on lowpower design
Work on, and approval of, the IEEE 7542008 floatingpoint standard
Three parallel and interacting trends:
Availability of many millions of transistors on a single microchip
Energy requirements and heat dissipation of the said transistors
Shift of focus from scientific computations to media processing
May 2010 Computer Arithmetic, Implementation Topics Slide 96
A.2 Early HighPerformance Computers
IBM System 360 Model 91 (360/91, for short; mid 1960s)
Part of a family of machines with the same instructionset architecture
Had multiple function units and an elaborate scheduling and interlocking
hardware algorithm to take advantage of them for high performance
Clock cycle = 20 ns (quite aggressive for its day)
Used 2 concurrently operating floatingpoint execution units performing:
Twostage pipelined addition
12 × 56 pipelined partialtree multiplication
Division by repeated multiplications (initial versions of the machine
sometimes yielded an incorrect LSB for the quotient)
May 2010 Computer Arithmetic, Implementation Topics Slide 97
The IBM
System 360
Model 91
Fig. A.1 Overall
structure of the IBM
System/360 Model
91 floatingpoint
execution unit.
Floating
Point
Inst ruct ion
Unit
RS1 RS2 RS3 RS1 RS2
Regist er Bus
Buffer Bus
Common Bus
Inst ruct ion
Buffers and
Controls
4 Regist ers 6 Buffers
To St orage
Adder
St age 1
Adder
St age 2
Result Bus
Result
Result
Multiply
Iterat ion
Unit
Propagate
Adder
From St orage To FixedPoint Unit
Add
Unit
Mul./
Div.
Unit
FloatingPoint
Execution Unit 1
Floating
Point
Execution
Unit 2
May 2010 Computer Arithmetic, Implementation Topics Slide 98
A.3 Deeply Pipelined Vector Machines
Cray XMP/Model 24 (multipleprocessor vector machine)
Had multiple function units, each of which could produce a new result
on every clock tick, given suitably long vectors to process
Clock cycle = 9.5 ns
Used 5 integer/logic function units and 3 floatingpoint function units
Integer/Logic units: add, shift, logical 1, logical 2, weight/parity
Floatingpoint units: add (6 stages), multiply (7 stages),
reciprocal approximation (14 stages)
Pipeline setup and shutdown overheads
Vector unit not efficient for short vectors (breakeven point)
Pipeline chaining
May 2010 Computer Arithmetic, Implementation Topics Slide 99
Cray XMP
Vector
Computer
Fig. A.2 The
vector section
of one of the
processors in
the Cray XMP/
Model 24
supercomputer.
May 2010 Computer Arithmetic, Implementation Topics Slide 100
A.4 The DSP Revolution
Specialpurpose DSPs have used a wide variety of unconventional
arithmetic methods; e.g., RNS or logarithmic number representation
Generalpurpose DSPs provide an instruction set that is tuned to the
needs of arithmeticintensive signal processing applications
Example DSP instructions
ADD A, B { A + B ÷ B }
SUB X, A { A – X ÷ A }
MPY ±X1, X0, B { ±X1 × X0 ÷ B }
MAC ±Y1, X1, A { A ± Y1 × X1 ÷ A }
AND X1, A { A AND X1 ÷ A }
Generalpurpose DSPs come in integer and floatingpoint varieties
May 2010 Computer Arithmetic, Implementation Topics Slide 101
FixedPoint
DSP Example
Fig. A.3 Block diagram
of the data ALU in
Motorola’s DSP56002
(fixedpoint) processor.
B Shifter/Limiter
X Bus
Y Bus
X1 X0
Y1 Y0
X
Y
24
24
24
24
24
A1 A0
B1 B0
A
B
A2
B2
56
56
Shifter
56
A Shifter/Limiter
Accumulator,
Rounding, and
Logical Unit
Multiplier
Input
Registers
Accumulator
Registers
24+
Ovf
May 2010 Computer Arithmetic, Implementation Topics Slide 102
FloatingPoint
DSP Example
Fig. A.4 Block diagram
of the data ALU in
Motorola’s DSP96002
(floatingpoint) processor.
I/O Format Converter
X Bus
Y Bus
32
32
Register File
10 96bit,
or 10 64bit,
or 30 32bit
Add/
Subtract
Unit
Multiply
Unit
Special
Func tion
Unit
May 2010 Computer Arithmetic, Implementation Topics Slide 103
A.5 Supercomputers on Our Laps
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA
Half a dozen or so pipeline stages
80286
80386
80486
Pentium (80586)
A dozen or so pipeline stages, with outoforder instruction execution
Pentium Pro
Pentium II
Pentium III
Celeron
Two dozens or so pipeline stages
Pentium 4
More advanced
technology
More advanced
technology
Instructions are broken
into microops which are
executed outoforder
but retired inorder
May 2010 Computer Arithmetic, Implementation Topics Slide 104
Performance Trends in Intel Microprocessors
1990 1980 2000 2010
KIPS
MIPS
GIPS
TIPS
P
r
o
c
e
s
s
o
r
p
e
r
f
o
r
m
a
n
c
e
Calendar year
80286
68000
80386
80486
68040
Pentium
Pentium II R10000
×1.6 / yr
May 2010 Computer Arithmetic, Implementation Topics Slide 105
Arithmetic in the Intel Pentium Pro Microprocessor
Integer
Execution
Unit 0
80
80
80
Port0
Units
Port1
Units
Port 0
Port 1
Port 2
Dedicated to
memory access
(address
generation
units, etc)
Port 3
Port 4
Reservation
Station
Reorder
Buffer and
Retirement
Register
File
FLP Add
Integer Div
FLP Div
FLP Mult
Shift
Integer
Execution
Unit 1
Jump
Exec
Unit
Fig. 28.5 Key parts of the CPU in the Intel Pentium Pro (P6) microprocessor.
May 2010 Computer Arithmetic, Implementation Topics Slide 106
A.6 Trends, Outlook, and Resources
Current focus areas in computer arithmetic
Design: Shift of attention from algorithms to optimizations at the
level of transistors and wires
This explains the proliferation of hybrid designs
Technology: Predominantly CMOS, with a phenomenal rate of
improvement in size/speed
New technologies cannot compete
Applications: Shift from highspeed or highthroughput designs
in mainframes to embedded systems requiring
Low cost
Low power
May 2010 Computer Arithmetic, Implementation Topics Slide 107
Ongoing Debates and New Paradigms
Renewed interest in bit and digitserial arithmetic as mechanisms to
reduce the VLSI area and to improve packageability and testability
Synchronous vs asynchronous design (asynchrony has some overhead,
but an equivalent overhead is being paid for clock distribution and/or
systolization)
New design paradigms may alter the way in which we view or design
arithmetic circuits
Neuronlike computational elements
Optical computing (redundant representations)
Multivalued logic (match to highradix arithmetic)
Configurable logic
Arithmetic complexity theory
May 2010 Computer Arithmetic, Implementation Topics Slide 108
Computer Arithmetic Timeline
Fig. A.6 Computer arithmetic through the decades.
Decade
40s
50s
60s
70s
80s
90s
00s
10s
1940
2020
1960
1980
2000
Snapshot
[Burk46]
Key ideas, innovations, advancements, technology traits, and milestones
Binary format, carry chains, stored carry, carrysave multiplier, restoring divider
[Shaw50]
Carrylookahead adder, highradix multiplier, SRT divider, CORDIC algorithms
Tree/array multiplier, highradix & convergence dividers, signeddigit, floating point
Pipelined arithmetic, vector supercomputer, microprocessor, ARITH2/3/4 symposia
VLSI, embedded system, digital signal processor, online arithmetic, IEEE 7541985
CMOS dominance, circuitlevel optimization, hybrid design, deep pipeline, table lookup
Power/energy/heat reduction, media processing, FPGAbased arith., IEEE 7542008
Teraflops on laptop (or pocket device?), asynchronous design, nanodevice arithmetic
[MacS61]
[Thor70]
[Ande67]
[Swar90]
[Swar09]
[Garn76]]
May 2010 Computer Arithmetic, Implementation Topics Slide 109
The End!
You’re up to date. Take my advice and try to keep it that way. It’ll be
tough to do; make no mistake about it. The phone will ring and it’ll be the
administrator –– talking about budgets. The doctors will come in, and
they’ll want this bit of information and that. Then you’ll get the salesman.
Until at the end of the day you’ll wonder what happened to it and what
you’ve accomplished; what you’ve achieved.
That’s the way the next day can go, and the next, and the one after that.
Until you find a year has slipped by, and another, and another. And then
suddenly, one day, you’ll find everything you knew is out of date. That’s
when it’s too late to change.
Listen to an old man who’s been through it all, who made the mistake of
falling behind. Don’t let it happen to you! Lock yourself in a closet if you
have to! Get away from the phone and the files and paper, and read and
learn and listen and keep up to date. Then they can never touch you,
never say, “He’s finished, all washed up; he belongs to yesterday.”
Arthur Hailey, The Final Diagnosis
About This Presentation
This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN 9780195328486). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami
Edition First Second Released
Jan. 2000 May 2010
Revised
Sep. 2001
Revised
Sep. 2003
Revised
Oct. 2005
Revised
Dec. 2007
May 2010
Computer Arithmetic, Implementation Topics
Slide 2
VII Implementation Topics
Sample advanced implementation methods and tradeoffs • Speed / latency is seldom the only concern • We also care about throughput, size, power/energy • Faultinduced errors are different from arithmetic errors • Implementation on Programmable logic devices
Topics in This Part
Chapter 25 HighThroughput Arithmetic
Chapter 26 LowPower Arithmetic Chapter 27 FaultTolerant Arithmetic Chapter 28 Reconfigurable Arithmetic
May 2010 Computer Arithmetic, Implementation Topics Slide 3
Implementation Topics Slide 4 .May 2010 Computer Arithmetic.
a multiply may take 20 cycles. but a new one can begin every cycle Data availability and hazards limit the depth May 2010 Computer Arithmetic. one must Look beyond individual operations Trade off latency for throughput For example. Implementation Topics Slide 5 .25 HighThroughput Arithmetic Chapter Goals Learn how to improve the performance of an arithmetic unit via higher throughput rather than reduced latency Chapter Highlights To improve overall performance.
HighThroughput Arithmetic: Topics Topics in This Chapter 25.5 OnLine of DigitPipelined Arithmetic 25.6 Systolic Arithmetic Units May 2010 Computer Arithmetic.1 Pipelining of Arithmetic Functions 25.2 Clock Rate and Throughput 25.4 Parallel and DigitSerial Pipelines 25. Implementation Topics Slide 6 .3 The Earle Latch 25.
Implementation Topics Slide 7 .1 Pipelining of Arithmetic Functions t In Nonpipelined Inters tage latches In Out Fig. Dependencies may lead to bubbles or even drainage At times. in this case.1 An arithmetic function unit and its stage pipelined version. 25.. Out t/ + Latency. advantage is obvious May 2010 Computer Arithmetic.25. though a secondary consideration. Output latches Input latches 1 Throughput Operations per unit time Pipelining period Interval between applying successive inputs 2 3 t + . Occasional need for doing single operations b. is still important because: a.. pipelined implementation may improve the latency of a multistep computation and also reduce its cost.
Function is divisible into equal stages for any Then. Implementation Topics Fig.. Output latches Out 2 3 t + Throughput approaches its maximum of 1/ for large May 2010 Computer Arithmetic. Time overhead per stage is (latching delay) 2.. for the pipelined implementation: In t Nonpipelined Inters tage latches In Out Latency Throughput Cost T = R = G = t + 1 1 = T/ t/ + g + g Input latches 1 t/ + .1 Slide 8 .Analysis of Pipelining Throughput Consider a circuit with cost (gate count) g and latency t Simplifying assumptions for our analysis: 1. Cost overhead per stage is g (latching cost) 3. 25.
few stages are in order if pipelining overheads are fairly high All in all.Analysis of Pipelining CostEffectiveness T = t + Latency R = 1 1 = T/ t/ + Throughput G = g + g Cost Consider costeffectiveness to be throughput per unit cost E = R / G = / [(t + )(g + g)] To maximize E. it pays to have many stages if the function is very slow or complex Inversely related to pipelining delay and cost overheads. Implementation Topics Slide 9 . not a surprising result! May 2010 Computer Arithmetic. compute dE/d and equate the numerator with 0 tg – 2g = 0 opt = tg / (g) We see that the most costeffective number of pipeline stages is: Directly related to the latency and cost of the function.
25.2 Clock Rate and Throughput
Consider a stage pipeline with stage delay tstage One set of inputs is applied to the pipeline at time t1
At time t1 + tstage + , partial results are safely stored in latches
Apply the next set of inputs at time t2 satisfying t2 t1 + tstage + Therefore:
Clock period = t = t2 – t1 tstage +
Throughput = 1/ Clock period 1/(tstage + )
Input latches In 1 t/ + In
t Nonpipelined Inters tage latches Out
Output latches Out
2
3 t +
...
Fig. 25.1
May 2010
Computer Arithmetic, Implementation Topics
Slide 10
The Effect of Clock Skew on Pipeline Throughput
Two implicit assumptions in deriving the throughput equation below: One clock signal is distributed to all circuit elements t In All latches are clocked at precisely the same time Nonpipelined Throughput = 1/ Clock period 1/(tstage + )
Input latches In 1 t/ + Out
Fig. 25.1
Output latches Out
Inters tage latches
2
3 t +
...
Uncontrolled or random clock skew causes the clock signal to arrive at point B before/after its arrival at point A
With proper design, we can place a bound ±e on the uncontrolled clock skew at the input and output latches of a pipeline stage Then, the clock period is lower bounded as: Clock period = t = t2 – t1 tstage + + 2e
May 2010 Computer Arithmetic, Implementation Topics Slide 11
Wave Pipelining: The Idea
The stage delay tstage is really not a constant but varies from tmin to tmax tmin represents fast paths (with fewer or faster gates) tmax represents slow paths Suppose that one set of inputs is applied at time t1 At time t1 + tmax + , the results are safely stored in latches If that the next inputs are applied at time t2, we must have:
t2 + tmin t1 + tmax +
This places a lower bound on the clock period: Clock period = t = t2 – t1 tmax – tmin +
Two roads to higher pipeline throughput: Reducing tmax Increasing tmin
Thus, we can approach the maximum possible pipeline throughput of 1/ without necessarily requiring very small stage delay All we need is a very small delay variance tmax – tmin
May 2010 Computer Arithmetic, Implementation Topics Slide 12
Visualizing Wave Pipelining Wavefront i+3 (not yet applied) Wavefront i+2 Wavefront i+1 Wavefront i (arriving at output) Faster signals Stage output Stage input Slower signals Allowance for latching. skew. May 2010 Computer Arithmetic. t –t max min Fig. 25.2 Wave pipelining allows multiple computational wavefronts to coexist in a single pipeline stage . etc. Implementation Topics Slide 13 .
Implementation Topics Slide 14 May 2010 .3 Alternate view of the throughput advantage of wave pipelining over ordinary pipelining. 25.Another Visualization of Wave Pipelining Stage output t min t max Logic depth Stage input Stationary Stationary region region (unshaed) (unshaded) Clock cycle Transient Transient region region (unshaed) (shaded) Time (a) Ordinary pipelining (a) Controlled clock skew Stage output t min t max Stage input Logic depth Fig. Time Clock cycle (b)(b) Wave pipelining Computer Arithmetic.
5 m (speed of light = 0. Implementation Topics Slide 15 .Difficulties in Applying Wave Pipelining Sender 10 b Gb/s link (cable) 30 m Receiver LAN and other highspeed links (figures rounded from Myrinet data [Bode95]) Gb/s throughput Clock rate = 108 Clock cycle = 10 ns In 10 ns. but has power penalty May 2010 Computer Arithmetic.3 m/ns) For a 30 m cable. 2030 characters will be in flight at the same time At the circuit and logic level (mmm distances. not m). delay equalization to reduce tmax – tmin is nearly impossible in CMOS technology: CMOS 2input NAND delay varies by factor of 2 based on inputs Biased CMOS (pseudoCMOS) fairs better. signals travel 11. there are still problems to be worked out For example.
a new input enters the pipeline stage every t time units and the stage latency is tmax + Thus. for proper sampling of the results. Implementation Topics Slide 16 .Controlled Clock Skew in Wave Pipelining With wave pipelining. clock application at the output latch must be skewed by (tmax + ) mod t Example: tmax + = 12 ns. the value of tmax – tmin > 0 may be different for each stage t maxi=1 to [tmax(i) – tmin(i) + ] The controlled clock skew at the output of stage i needs to be: S(i) = j=1 to i [tmax(i) – tmin(i) + ] mod t May 2010 Computer Arithmetic. t = 5 ns A clock skew of +2 ns is required at the stage output latches relative to the input latches In general.
Random Clock Skew in Wave Pipelining
ee
Logic depth
Stage input
Clock period = t = t2 – t1 tmax – tmin + + 4e Reasons for the term 4e: Clocking of the first input set may lag by e, while that of the second set leads by e (net difference = 2e) The reverse condition may exist at the output side Uncontrolled skew has a larger effect on wave pipelining than on standard pipelining, especially when viewed in relative terms
May 2010
Stage output
Time
Clock cycle ee
Stage output
Stage input
Logic depth
Time
Clock cycle
Graphical justification of the term 4e
Slide 17
Computer Arithmetic, Implementation Topics
25.3 The Earle Latch
d C x _ C w z y
Earle latch can be merged with a preceding 2level ANDOR logic
v w C
Fig. 25.4 Twolevel ANDOR realization of the Earle latch.
x y
z
Example: To latch d = vw + xy, _ substitute for d in the latch equation C z = dC + dz +`Cz to get a combined “logic + latch” circuit implementing z = vw + xy z = (vw + xy)C + (vw + xy)z +`Cz = vwC + xyC + vwz + xyz +`Cz
May 2010
Fig. 25.5 Twolevel ANDOR latched realization of the function z = vw + xy.
Slide 18
Computer Arithmetic, Implementation Topics
Clocking Considerations for Earle Latches
We derived constraints on the maximum clock rate 1/t Clock period t has two parts: clock high, and clock low
t
= Chigh + Clow
Consider a pipeline stage between Earle latches Chigh must satisfy the inequalities 3dmax – dmin + Smax(C,`C) Chigh 2dmin + tmin
The clock pulse must be wide enough to ensure that valid data is stored in the output latch and to avoid logic hazard should _ C slightly lead C Clock must go low before the fastest signals from the next input data set can affect the input z of the latch
dmax and dmin are maximum and minimum gate delays Smax(C,`C) 0 is the maximum skew between C and`C
May 2010 Computer Arithmetic, Implementation Topics Slide 19
25. Implementation Topics Slide 20 . 25.4 Parallel and DigitSerial Pipelines a b c d e f + / (a + b) c d ef z Latch positions in a fourstage pipeline + t=0 Pipelining period Latency / Time Output available Fig. May 2010 Computer Arithmetic.6 Flowgraph representation of an arithmetic expression and timing diagram for its evaluation with digitparallel computation.
= .123x .?xxx .= .24xx .12xx . Implementation Topics Slide 21 .2469 . but division and squarerooting are MSBfirst operations Besides.Feasibility of BitLevel or DigitLevel Pipelining Bitserial addition and multiplication can be done LSBfirst.= .2xxx .1234/. because the MSB of the quotient q in general depends on all the bits of the dividend and divisor Example: Consider the decimal division .246x Solution: Redundant number representation! May 2010 Computer Arithmetic. division can’t be done in pipelined bitserial fashion.1xxx .?xxx .?xxx .
25.7 Digitparallel versus digitpipelined computation. 25. Computer Arithmetic.5 OnLine or DigitPipelined Arithmetic + t=0 t=0 Begin next computation Digitparallel / Time Digitserial Output complete Output available + / Output Operation latencies May 2010 (a + b) c d ef Fig. Implementation Topics Slide 22 .
25. Implementation Topics Slide 23 .5 Shaded boxes show the "unseen" or unprocessed parts of the operands and unknow n part of the sum BSD example: .0 1 1 .1 Shaded boxes show the "unseen" or unprocess ed parts of the operands and unknown part of the s um x–i y–i x–i y–i w–i (interim sum) t –i+1 s –i+1 Latch Latch w–i+1 Fig.1 8 + .9 Digitpipelined MSDfirst limitedcarry addition.1 0 1 + . p –i (position sum ) e–i+1 w –i+1 (interim sum) Latch t –i+2 s–i+2 Latch Latch w–i+2 Fig.4 2 .8 Digitpipelined MSDfirst carryfree addition. 25.DigitPipelined Adders Decimal example: . p –i+1 May 2010 Computer Arithmetic.
1 0 1 1 0 .10 Digitpipelined MSDfirst multiplication process.DigitPipelined Multiplier: Algorithm Visualization Being processed Already processed Not yet known . . . 1 1 1 . 0 May 2010 Computer Arithmetic.1 1 0 1 Fig. a x . Implementation Topics Slide 24 . 1 0 1 . 25.
11 Digitpipelined MSDfirst BSD multiplier. Implementation Topics . 25. Slide 25 May 2010 Computer Arithmetic.DigitPipelined Multiplier: BSD Implementation Partial Multiplicand a –i 0 –1 Partial Multiplier 0 –1 1 0 x–i Mux Mux 1 0 Product Residual Shift 3operand carryfree adder p–i+2 MSD Fig.
5/16) [–2. .0 0 1 0 .1 Example of digitpipelined division showing that three cycles of delay are necessary before quotient digits can be output (radix = 4. 2/3) (–2/4.)four (–2/3. . .1 .)four (. . . 2] [–2. .)four (. .DigitPipelined Divider Table 25. 2/4) (1/16. . .)four (. . .)four (. . 2] [0. 14/64) 1 –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– May 2010 Computer Arithmetic.0 0 1 .1–2 . 2]) –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Cycle Dividend Divisor q Range q–1 Range –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1 2 3 (. 1] 4 (. . . . .1–2–2–2 . Implementation Topics Slide 26 .0 .1–2–2 .)four (.)four (10/64. digit set = [–2.0 0 .)four (.
. . 6].2 Examples of digitpipelined squareroot computation showing that 12 cycles of delay are necessary before root digits can be output (radix = 10. digit set = [–1. .DigitPipelined SquareRooter Table 25. 26/75 ) 6 ––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1 2 (.3 4 . . digit set = [–6. . 11/30 ) [5. . 6] 2 (.0 . .3 . .)ten ( 7/30 . 1]) ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Cycle Radicand q Range q–1 Range ––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1 (. 1/2 ) 1 ––––––––––––––––––––––––––––––––––––––––––––––––––––––– May 2010 Computer Arithmetic. 1/2 ) [–2. 1/2 ) (0.)two (0.)two (. Implementation Topics Slide 27 . and radix = 2. .0 1 1 . .)ten ( 1/3 . 2] [0.)two (1/2. 1] 3 (.0 1 .
Implementation Topics Slide 28 . May 2010 Computer Arithmetic.DigitPipelined Arithmetic: The Big Picture Online arithmetic unit Output already produced Processed input parts Residual Unprocessed input parts Fig. 25.12 Conceptual view of onLine or digitpipelined arithmetic.
. May 2010 Computer Arithmetic.13 Highlevel design of a systolic radix4 digitpipelined multiplier.6 Systolic Arithmetic Units Systolic arrays: Cellular circuits in which data elements Enter at the boundaries Advance from cell to cell in lock step Are transformed in an incremental fashion Leave from the boundaries Systolic design mitigates the effect of signal propagation delay and allows the use of very clock rates a –i x–i p–i+1 Head Cell . 25...25. Fig.. . ... Implementation Topics Slide 29 .
25.14 Conventional and systolic realizations of a programmable FIR filter. broadcast data (b) Systolic: Pipelined control. May 2010 Computer Arithmetic. pipelined data Fig. Implementation Topics Slide 30 .Case Study: Systolic Programmable FIR Filters (a) Conventional: Broadcast control.
embedded) Difficulty of heat disposal Algorithm and logiclevel methods: discussed Technology and circuit methods: ignored here May 2010 Computer Arithmetic. Implementation Topics Slide 31 .26 LowPower Arithmetic Chapter Goals Learn how to improve the power efficiency of arithmetic circuits by means of algorithmic and logic design strategies Chapter Highlights Reduced power dissipation needed due to Limited source (portable.
Implementation Topics Slide 32 .6 New and Emerging Methods May 2010 Computer Arithmetic.3 Reduction of Power Waste 26.5 Transformations and Tradeoffs 26.4 Reduction of Activity 26.2 Sources of Power Consumption 26.LowPower Arithmetic: Topics Topics in This Chapter 26.1 The Need for LowPower Design 26.
but still manageable Cooling of MPPs and server farms is a BIG challenge New battery technologies cannot keep pace with demand Demand for more speed and functionality (multimedia.difficult. Implementation Topics Slide 33 .) May 2010 Computer Arithmetic.26.2 watthr per gram of weight Practical battery weight < 500 g (< 50 g if wearable device) Total power 510 watt for a day’s work between recharges Modern highperformance microprocessors use 100s watts Power is proportional to die area clock frequency Cooling of micros.1 The Need for LowPower Design Portable and wearable electronic devices Lithiumion batteries: 0. etc.
26. Implementation Topics Slide 34 .1 May 2010 Power consumption trend in DSPs [Raba98].Processor Power Consumption Trends Power consumption per MIPS (W) The factorof100 improvement per decade in energy efficiency has been maintained since 2000 Fig. Computer Arithmetic.
2 Sources of Power Consumption Both average and peak power are important Average power determines battery life or heat dissipation Peak power impacts power distribution and signal integrity Typically.26. lowpower design aims at reducing both Power dissipation in CMOS digital circuits Static: Leakage current in imperfect switches (< 10%) Dynamic: Due to (dis)charging of parasitic capacitance Pavg a f C V 2 “activity” data rate (clock frequency) Square of voltage Capacitance May 2010 Computer Arithmetic. Implementation Topics Slide 35 .
Implementation Topics Slide 36 . Then: Pavg a f C V 2 = 0.5.2 108 (32 30 10–12) 52 = 0.Power Reduction Strategies: The Big Picture For a given data rate f. assume a = 0. If random values were put on the bus in every cycle. Lowering the switching activity a Pavg a f C V 2 Example: A 32bit offchip bus operates at 5 V and 100 MHz and drives a capacitance of 30 pF per bit. Reducing the parasitic capacitance C 3. To account for data correlation and idle bus cycles. we would have a = 0.2. Using a lower supply voltage V 2. there are but 3 ways to reduce the power requirements: 1.48 W May 2010 Computer Arithmetic.
26. Implementation Topics Slide 37 . 0 Mux 1 FU Inputs FU Output Latches Function Unit Select Fig. 26.2 Saving power through clock gating.3 May 2010 Saving power via guarded evaluation.3 Reduction of Power Waste Data Inputs Data Outputs Function Unit Clock Enable Fig.26. Computer Arithmetic.
Implementation Topics Slide 38 . 26.4 May 2010 Example of glitching in a ripplecarry adder.Glitching and Its Impact on Power Waste xi yi ci c0 pi Carry propagation si xi yi ci si Fig. Computer Arithmetic.
Implementation Topics Slide 39 .5 May 2010 An array multiplier with gated FA cells. Computer Arithmetic. 26.Array Multipliers with Lower Power Consumption 0 x0 Carry a4 0 a3 Sum 0 a2 0 0 a1 a0 0 x1 p0 0 x2 x3 x4 p9 p8 p7 p6 p5 p1 0 p2 0 p3 0 p4 Fig.
26. 26. May 2010 Computer Arithmetic. 26. Implementation Topics Slide 40 .7 Reduction of activity via Shannon expansion. Fig.4 Reduction of Activity n inputs Precomputat ion m bits n – m bits x n–1 n – 1 inputs Load enable Function Un it for xn–1 = 0 Function Un it for xn–1 = 1 Arit hmet ic Circuit 0 Out put Select Mux 1 Fig.6 Reduction of activity by precomputation.
26.5 Transformations and Tradeoffs Clock f Input Reg. Output Reg.5f Capacit an ce = 2. Clock f Input Reg. Arit hmetic Circuit Circuit Co py 1 Circuit Co py 2 Circuit St age 1 Regist er Circuit St age 2 Output Reg. Clock f Input Reg.8 Reduction of power via parallelism or pipelining.26.4 32P Frequency = f Capacit an ce = C Vo lt age = V Po wer = P Frequency = 0 .2C Vo lt age = 0. May 2010 Computer Arithmetic. Frequency = f Capacit an ce = 1.2C Vo lt age = 0.6V Po wer = 0 . Select Clock f Mux Output Reg.6V Po wer = 0 .3 96P Fig. Implementation Topics Slide 41 .
Unrolling of Iterative Computations x (i) x (i–1) x (i) a y (i) y (i–1) b b a y (i–1) y (i–3) ab a y (i) y (i–2) b2 (a) Simple (b) Unrolled once Fig. Implementation Topics Slide 42 . 26. May 2010 Computer Arithmetic.9 Realization of a firstorder IIR filter.
May 2010 Computer Arithmetic.Retiming for Power Efficiency x (i) a x (i–1) b x (i–2) c x (i–3) d y (i–1) y (i) x (i) a b c d y (i–1) y (i) w (i–1) v (i–1) u (i–1) u (i) (a) Original (b) Retimed Fig. Implementation Topics Slide 43 .10 Possible realizations of a fourthorder FIR filter. 26.
Implementation Topics .11 Part of an asynchronous chain of computations. Slide 44 Arit hmetic Circuit Lo cal Co ntrol May 2010 Computer Arithmetic. 26. but it offers the advantage of complete insensitivity to delays Fig.6 New and Emerging Methods Data Arit hmetic Circuit Data ready Dualrail data encoding with transition signaling: Lo cal Co ntrol Release Two wires per signal Transition on wire 0 (1) indicates the arrival of 0 (1) Arit hmetic Circuit Lo cal Co ntrol Dualrail design does increase the wiring density.26.
A B FG P=A Q=AB PG (c) Feynman gate (d) Peres gate Fig. 26. with a single Feynman gate used to fan out the input B. 26.12 Some reversible logic gates. Implementation Topics .” May 2010 A B Cout + G A s s (sum) Slide 45 B 0 C 1 0 Computer Arithmetic. The label “G” denotes “garbage.The Ultimate in LowPower Design A B C TG (a) Toffoli gate A B C P=A Q=B R = AB C A B C FRG P=A Q = AB AC R = AC AB (b) Fredkin gate P=A Q=AB R = AB C Fig.13 Reversible binary full adder built of 5 Fredkin gates.
put millions / billions of them together and something is bound to go wrong Can arithmetic be protected via encoding? Reliable circuits and robust algorithms May 2010 Computer Arithmetic. Implementation Topics Slide 46 .27 FaultTolerant Arithmetic Chapter Goals Learn about errors due to hardware faults or hostile environmental conditions. but . and how to deal with or circumvent them Chapter Highlights Modern components are very robust. . .
5 AlgorithmBased Fault Tolerance 27.3 Arithmetic ErrorCorrecting Codes 27.6 FaultTolerant RNS Arithmetic May 2010 Computer Arithmetic.4 SelfChecking Function Units 27.FaultTolerant Arithmetic: Topics Topics in This Chapter 27.2 Arithmetic ErrorDetecting Codes 27. Errors. Implementation Topics Slide 47 . and Error Codes 27.1 Faults.
Slide 48 Computer Arithmetic. Implementation Topics .1 Faults.27. 27. Errors.1 A common way of applying information coding techniques. and Error Codes Input Encode Send Unprotected Protected by Encoding enc oding Store Manipulate Send Decode Output May 2010 Fig.
Slide 49 Computer Arithmetic. 27.Fault Detection and Fault Masking Coded inputs Decode 1 ALU 1 Encode Coded outp uts Decode 2 Non codeword detect ed Coded inputs ALU 2 Comp are Mismat ch detect ed ALU 1 (a) Duplication and comparison Decode 1 (b) Triplication and voting Coded outp uts Decode 2 ALU 2 Vot e Encode Decode 3 May 2010 ALU 3 Fig.2 Arithmetic fault detection or fault tolerance (masking) with replicated units. Implementation Topics .
positive 1101 1111 1111 0100 0110 0000 0000 0100 –32752 = –215 + 24 –1000 0000 0001 0000 2 Double.Inadequacy of Standard Error Coding Methods Unsigned addition Correct sum Erroneous sum 0010 0111 0010 0001 + 0101 1000 1101 0011 ––––––––––––––––– 0111 1111 1111 0100 1000 0000 0000 0100 Fig. 27. Implementation Topics . Stage generating an erroneous carry of 1 The arithmetic weight of an error: Min number of signed powers of 2 that must be added to the correct value to produce the erroneous result Example 1  Example 2  Correct value Erroneous value Difference (error) Minweight BSD Arithmetic weight Error type May 2010 0111 1111 1111 0100 1000 0000 0000 0100 16 = 24 0000 0000 0001 0000 1 Single.3 How a single carry error can produce an arbitrary number of biterrors (inversions). negative Slide 50 Computer Arithmetic.
both of which are based on a check modulus A (usually a small odd number) Product or AN codes Represent the value N by the number AN Residue (or inverse residue) codes Represent the value N by the pair (N.27. Implementation Topics Slide 51 . C).2 Arithmetic ErrorDetecting Codes Arithmetic errordetecting codes: Are characterized by arithmetic weights of detectable errors Allow direct arithmetic on coded operands We will discuss two classes of arithmetic errordetecting codes. where C is N mod A or (N – N mod A) mod A May 2010 Computer Arithmetic.
or 31 Error detection: check divisibility by A Encoding/decoding: multiply/divide by A Arithmetic also requires multiplication and division by A Product codes are nonseparate (nonseparable) codes Data and redundant check info are intermixed May 2010 Computer Arithmetic. all weight1 arithmetic errors are detected Arithmetic errors of weight 2 may go undetected e. 11.g. the error 32 736 = 215 – 25 undetectable with A = 3.Product or AN Codes For odd A.. Implementation Topics Slide 52 .
xxxx 0000 – . . Implementation Topics Slide 53 .1: Any unidirectional error with arithmetic weight of at most a – 1 is detectable by a lowcost product code based on A = 2a – 1 May 2010 Computer Arithmetic. .LowCost Product Codes Lowcost product codes use lowcost check moduli of the form A = 2a – 1 Multiplication by A = 2a – 1: done by shiftsubtract Division by A = 2a – 1: done a bits at a time as follows Given y = (2a – 1)x. xxxx xxxx Unknown 2a x Known (2a – 1)x Unknown x Theorem 27. . find x by computing 2a x – y . xxxx xxxx = . . . .
Implementation Topics Slide 54 . q is unprotected Possible cure: premultiply the dividend Az by A The result will need correction Square rooting leads to a problem similar to division A2x = A x which is not the same as A x May 2010 Computer Arithmetic. if z = qd + s.Arithmetic on ANCoded Operands Add/subtract is done directly: Ax Ay = A(x y) Direct multiplication results in: Aa Ax = A2ax The result must be corrected through division by A For division. we have: Az = q(Ad) + As Thus.
Residue and Inverse Residue Codes Represent N by the pair (N. C(y)) = (x y. C(x)) = (a x. compute C(N) = N mod A Lowcost residue codes use A = 2a – 1 Arithmetic on residuecoded operands Add/subtract: data and check parts are handled separately (x. C(x)) (y. (C(a)C(x)) mod A) Divide/squareroot: difficult May 2010 Computer Arithmetic. where C(N) = N mod A Residue codes are separate (separable) codes Separate data and check parts make decoding trivial Encoding: given N. C(a)) (x. C(N)). (C(x) C(y)) mod A) Multiply (a. Implementation Topics Slide 55 .
4 Arithmetic processor with residue checking. Slide 56 Computer Arithmetic. C(x)) = (a x. 27. C(x)) (y. C(y)) = (x y. Implementation Topics .Arithmetic on ResidueCoded Operands Add/subtract: Data and check parts are handled separately (x. (C(x) C(y)) mod A) Multiply (a. C(a)) (x. (C(a)C(x)) mod A) Divide/squareroot: difficult x C(x) y C(y) Main Arithmetic Processor mod A Check Processor Compare Error Indicator May 2010 z C(z) Fig.
Implementation Topics .Example: Residue Checked Adder x. y mod A Add Add mod A Find mod A Compare Not equal s. s mod A May 2010 Error Slide 57 Computer Arithmetic. x mod A y.
384 4 4 –16. 15) biresidue code Because all the symptoms in this table are different. mod 15) biresidue code Slide 58 . Implementation Topics Table 27.3 Arithmetic ErrorCorrecting Codes –––––––––––––––––––––––––––––––––––––––– Positive Syndrome Negative Syndrome error mod 7 mod 15 error mod 7 mod 15 –––––––––––––––––––––––––––––––––––––––– 1 1 1 –1 6 14 2 2 2 –2 5 13 4 4 4 –4 3 11 8 1 8 –8 6 7 16 2 1 –16 5 14 32 4 2 –32 3 13 64 1 4 –64 6 11 128 2 8 –128 5 7 256 4 1 –256 3 14 512 1 2 –512 6 13 1024 2 4 –1024 5 11 2048 4 8 –2048 3 7 –––––––––––––––––––––––––––––––––––––––– 4096 1 1 –4096 6 14 8192 2 2 –8192 5 13 16.384 3 11 32.1 Error syndromes for weight1 arithmetic errors in the (7.768 1 8 –32. any weight1 arithmetic error is correctable by the (mod 7.768 6 7 –––––––––––––––––––––––––––––––––––––––– May 2010 Computer Arithmetic.27.
Properties of Biresidue Codes Biresidue code with relatively prime lowcost check moduli A = 2a – 1 and B = 2b – 1 supports a b bits of data for weight1 error correction Representational redundancy = (a + b)/(ab) = 1/a + 1/b May 2010 Computer Arithmetic. Implementation Topics Slide 59 .
4 SelfChecking Function Units Selfchecking (SC) unit: any fault from a prescribed set does not affect the correct output (masked) or leads to a noncodeword output (detected) An invalid result is: Detected immediately by a code checker. Implementation Topics Slide 60 . we need SC code checkers that never validate a noncodeword. or Propagated downstream by the next selfchecking unit To build SC units.27. even when they are faulty May 2010 Computer Arithmetic.
0) or (0.Design of a SelfChecking Code Checker Example: SC checker for inverse residue code (N. 1) xi yi xj yj May 2010 Fig. yi). for use in a selfchecking code checker. 0) or (0.5 Twoinput AND circuit. yi) and (xi. 27. C' (N)) N mod A should be the bitwise complement of C' (N) Verifying that signal pairs (xi. 1) 0: (0. yi) are all (1. Implementation Topics . Slide 61 Computer Arithmetic. 1) is the same as finding the AND of Boolean values encoded as: 1: (1. with 2bit inputs (xi. 0) or (1.
6 Selfchecking adders with parityencoded inputs and output. Implementation Topics Slide 62 . Ordinary ALU Error (a) Parity prediction (b) Parity prediction / k P/R k +h / k +h / Parityencoded inputs / k Parityencoded output R/P / k P/R = Paritytoredundant converter R/P = Redundanttoparity converter ALU P/R k +h / Redundant paritypreserving ALU (c) Parity/redundant and redundant/parity code conversion (b) Parity preservation May 2010 Computer Arithmetic. 27.Case Study: SelfChecking Adders / k Parity generator Parity predictor Parityencoded inputs / k Parityencoded output / k ALU Fig.
Slide 63 May 2010 Computer Arithmetic. column. Implementation Topics . and full checksum matrices Mr. Mc. and Mf.5 AlgorithmBased Fault Tolerance Alternative strategy to error detection after each basic operation: Accept that operations may yield incorrect results Detect/correct errors at datastructure or application level Example: multiplication of matrices X and Y yielding P Row.27.7 A 3 3 matrix M with its row. and full checksum matrices (mod 8) M = 2 1 6 5 3 4 3 2 7 2 5 3 2 1 3 2 6 6 4 7 1 Mr = 2 1 6 1 5 3 4 4 3 2 7 4 2 5 3 2 1 3 2 6 6 4 7 1 1 4 4 1 Mc = Mf = Fig. column. 27.
7 M = Mr = Mc = 2 5 3 2 1 3 2 6 6 4 7 1 Mf = 2 5 3 2 1 3 2 6 6 4 7 1 1 4 4 1 Theorem 27. Implementation Topics Slide 64 .Properties of Checksum Matrices Theorem 27. we have Pf = Xc Yr (with floatingpoint values. the equalities are approximate) 2 1 6 5 3 4 3 2 7 2 1 6 1 5 3 4 4 3 2 7 4 Fig.4: In a fullchecksum matrix. any single erroneous element can be corrected and any three errors can be detected May 2010 Computer Arithmetic.3: If P = X Y . 27.
Compare the computed and actual modm residues The beauty of this method is that arithmetic algorithms are completely unaffected. Dynamic range M = 8 7 5 3 = 840. redundant modulus: 11. say m): 1. error detection is made possible by simply extending the dynamic range of the RNS May 2010 Computer Arithmetic. Use other residues to compute the residue of the number mod m (this process is known as base extension) 2.27.6 FaultTolerant RNS Arithmetic Residue number systems allow very elegant and effective error detection and correction schemes by means of redundant residues (extra moduli) Example: RNS(8  7  5  3). Any error confined to a single residue detectable. Error detection (the redundant modulus must be the largest one. Implementation Topics Slide 65 .
1) via base extension Reconstructed number = ( 5. 0. 1. 0. 6.6.1. with redundant moduli 13 and 11 Representation of 25 = (12.1. 6. 1. 1)RNS The difference between the first two components of the corrupted and reconstructed numbers is (+7. 1)RNS Corrupted version = (12.0. 4. 0.0.Example RNS with two Redundant Residues RNS(8  7  5  3).1) to (5. Implementation Topics Slide 66 . 3.–. allowing us to correct the error May 2010 Computer Arithmetic. 1)RNS Transform (–. 1. 1. 3.6.1. +2) This constitutes a syndrome.
lowvolume. prototype systems) Chapter Highlights Suitable adder designs beyond ripplecarry Design choices for multipliers and dividers Tablebased and “distributed” arithmetic Techniques for function evaluation Enhanced FPGAs and higherlevel alternatives May 2010 Computer Arithmetic.28 Reconfigurable Arithmetic Chapter Goals Examine arithmetic algorithms and designs appropriate for implementation on FPGAs (oneofakind. Implementation Topics Slide 67 .
Implementation Topics Slide 68 .1 Programmable Logic Devices 28.5 Function Evaluation on FPGAs 28.2 Adder Designs for FPGAs 28.6 Beyond FineGrained Devices May 2010 Computer Arithmetic.3 Multiplier and Divider Designs 28.Reconfigurable Arithmetic: Topics Topics in This Chapter 28.4 Tabular and Distributed Arithmetic 28.
1 May 2010 Examples of programmable sequential logic. Implementation Topics Slide 69 . 28.28.1 Programmable Logic Devices 8input ANDs LB I/O block I/O blocks LB LB LB LB CLB LB LB LB LB LB CLB LB LB LB LB LB 01 Mu x C Q FF D LB LB CLB LB LB LB CLB LB Q Mu x 01 Logic block Configurable (or LB cluster) logic block Programmable Programmable interconnects connections (a) Portion of PAL with storable output (b) Generic structure of an FPGA Fig. Computer Arithmetic.
2 Some memorycontrolled switches and interconnections in programmable logic devices. May 2010 Computer Arithmetic.Programmability Mechanisms Slide to be completed Memory cell 0 1 Memory cell (a) Tristate buffer (b) Pass transistor Memory cell (c) Multiplexer Fig. 28. Implementation Topics Slide 70 .
3 Structure of a simple logic block. 28.Configurable Logic Blocks Carryout Inputs 1 y0 1 0 1 2 x0 x1 x2 x3 Logic or LUT 0 1 0 y1 0 0 1 2 3 4 1 FF y2 Outputs 1 0 x4 Carryin Fig. May 2010 Computer Arithmetic. Implementation Topics Slide 71 .
28. May 2010 LB or cluster LB or cluster LB or cluster Vertical wiring channels Computer Arithmetic. Implementation Topics Slide 72 .The Interconnect Fabric LB or cluster Switch box LB or cluster Switch box LB or cluster Horizontal wiring channels LB or cluster Switch box LB or cluster Switch box LB or cluster Fig.4 A possible arrangement of programmable interconnects between LBs or LB clusters.
in terms of both function and timing. typically via a hardware description language such as Verilog. Verification: Ensuring the correctness of the final design. Synthesis: Converting the design files into interconnected networks of gates and other standard logic circuit elements 3. Programming: Uploading the bitstream file of stage 6 to memory elements within the FPGA device 8. or Abel 2. Placement: Mapping of the physical circuit elements of stage 3 to specific physical locations of the target FPGA device 5. via simulation and testing May 2010 Computer Arithmetic. Configuration: Generation of the requisite bitstream file that holds configuration bits for the target FPGA device 7. Partitioning: Assigning the logic elements of stage 2 to specific physical circuit elements that are capable of realizing them 4. Routing: Mapping of the interconnections prescribed in stage 2 to specific physical wires on the target FPGA device 6.Standard FPGA Design Flow 1. Implementation Topics Slide 73 . VHDL. Specification: Creating the design files.
2 Adder Designs for FPGAs This slide to include a discussion of ripplecarry adders and builtin carry chains in FPGAs May 2010 Computer Arithmetic. Implementation Topics Slide 74 .28.
May 2010 Computer Arithmetic. Implementation Topics Slide 75 .5 Possible design of a 16bit carryskip adder on an FPGA. 28.CarrySkip Addition Slide to be completed /5 /5 /6 /6 /5 /5 Skip logic cout Adder /5 0 cin Adder /6 Adder /5 1 Fig.
CarrySelect Addition Slide to be completed 6 bits 0 1 4 bits 0 1 3 bits 0 1 2 bits 0 1 1 bit /6 /4 /3 /2 Fig.6 Possible design of a carryselect adder on an FPGA. Implementation Topics Slide 76 . May 2010 Computer Arithmetic. 28.
3 Multiplier and Divider Designs Slide to be completed a1 a0 x1 x0 a3 a2 x3 x2 p0 p1 4 LUTs a3 a2 x1 x0 a1 a0 x3 x2 p2 p3 p4 p5 p6 p7 cout 0 6bit adder 4bit adder Fig. 28. May 2010 Computer Arithmetic.28. Implementation Topics Slide 77 .7 Divideandconquer 4 4 multiplier design using 4input lookup tables and ripplecarry adders.
using LUTs. Implementation Topics Slide 78 .Multiplication by Constants 8 LUTs Slide to be completed xL 13xL 4 / 0 xH 0 13x / 4 / 8 8 LUTs 13xH 8bit adder Fig.8 Multiplication of an 8bit input by 13. May 2010 Computer Arithmetic. 28.
Division on FPGAs Slide to be completed May 2010 Computer Arithmetic. Implementation Topics Slide 79 .
Implementation Topics Slide 80 .28.4 Tabular and Distributed Arithmetic Slide to be completed May 2010 Computer Arithmetic.
v. . yj(i1). y–l )2’scompl . t. xj(i1). y0(i2)) May 2010 Computer Arithmetic. x0(i1). y (i) = 2j f(xj(i). xj(i2).SecondOrder Digital Filter: Definition y (i) = a(0)x (i) + a(1)x (i–1) + a(2)x (i–2) – b(1)y (i–1) – b(2)y (i–2) a(j)s and b(j)s are constants Current and two previous inputs Two previous outputs x (i+1) x (i) . w) = a(0)s + a(1)t + a(2)u – b(1)v – b(2)w Filter Latch y (i) . . x (3) x (2) x (1) . where the summations range from j = – l to j = –1 y (i) = a(0)(–x0(i) + 2j xj(i)) + a(1)(–x0(i1) + 2j xj(i1)) + a(2)(–x0(i2) + 2j xj(i2)) – b(1)(–y0(i1) + 2j yj(i1)) – b(2)(–y0(i2) + 2j yj(i2)) Define f(s. y (3) y (2) y (1) . u. . y0(i1).x–1x–2 . yj(i2)) – f(x0(i). Expand the equation for y (i) in terms of the bits in operands x = (x0. x0(i2). . . Implementation Topics Slide 81 . x–l )2’scompl and y = (y0.y–1y–2 . .
9 Bitserial tabular realization of a secondorder filter. 32Entry 32entry Table lookup (ROM) table y (i–1) j Righ tShift Shift (i – 1) th output x (i–2) j Copy at the end of cycle (i – 2) th Reg.SecondOrder Digital Filter: BitSerial Implementation Input xj i th input Shift Reg. (i) LSBfirst Address In Dat a Out (m+3)Bit Register i th output being formed y (i) Regist er Out put Shift Regist er Shift Reg. Implementation Topics Slide 82 . f ± s (i – 1) th input xj (i – 2) th input (i–1) Shift Reg. output y (i–2) j Output LSBfirst Fig. 28. May 2010 Computer Arithmetic.
May 2010 Add/Sub x(4) Add/Sub y(4) Add/Sub z(4) Slide 83 Computer Arithmetic. 28. Implementation Topics .10 The first four stages of an unrolled CORDIC processor.28.5 Function Evaluation on FPGAs x y Sign logic z e(0) Slide to be completed Add/Sub x(1) >> 1 >> 1 Add/Sub y(1) Sign logic Add/Sub z(1) e(1) Add/Sub x(2) >> 2 >> 2 Add/Sub y(2) Sign logic Add/Sub z(2) e(2) Add/Sub x(3) >> 3 >> 3 Add/Sub y(3) Sign logic Add/Sub z(3) e(3) Fig.
11 Generic convergence structure for function evaluation. 28.Implementing Convergence Schemes Slide to be completed x y(0) Lookup table Convergence step y(1) Convergence step y(2) Convergence step f(x) Fig. May 2010 Computer Arithmetic. Implementation Topics Slide 84 .
Implementation Topics Slide 85 . May 2010 Computer Arithmetic.28. 28.12 The design space for arithmeticintensive applications.6 Beyond FineGrained Devices 1024 Instruction depth MPP Slide to be completed 256 GeneralGP purpose micro processor 64 16 SpecialSP purpose processor processor DPGA 4 1 1 FPGA 4 Fieldprogrammable Our approach arithmetic array 16 64 256 1024 1024 Word widt h (bits) Fig.
Implementation Topics Slide 86 . and Future Appendix Goals Wrap things up. provide perspective. Present. and examine arithmetic in a few key systems Appendix Highlights One must look at arithmetic in context of Computational requirements Technological constraints Overall system design goals Past and future developments Current trends and research directions? May 2010 Computer Arithmetic.A Past.
2 Early HighPerformance Computers A.6 Trends. Implementation Topics Slide 87 .Past.5 Supercomputers on Our Laps A. Present.3 Deeply Pipelined Vector Machines A. Outlook. and Resources May 2010 Computer Arithmetic. and Future: Topics Topics in This Chapter A.1 Historical Perspective A.4 The DSP Revolution A.
and restoring division Modern reconstruction from Meccano parts. Implementation Topics Slide 88 .A. http://www.meccano. carrysave addition.us/difference_engines/ 1848 May 2010 Computer Arithmetic.1 Historical Perspective Babbage was aware of ideas such as carryskip addition.
and shiftadd multiplication and division algorithms were developed and finetuned A seminal report by A. use of complement representation. and restoring division State of computer arithmetic circa 1950: Overview paper by R. Shaw [Shaw50] May 2010 Computer Arithmetic.F. fast multiplication via carrysave addition. and J. von Neumann contained ideas on choice of number radix. carry propagation chains. Burkes.Computer Arithmetic in the 1940s Machine arithmetic was crucial in proving the feasibility of computing with storedprogram electronic devices Hardware for addition/subtraction. Implementation Topics Slide 89 . H. Goldstein.W.H.
virtually all important fastadder designs had already been published or were in the final phases of development Residue arithmetic. SRT division. Implementation Topics Slide 90 . MacSorley [MacS61] May 2010 Computer Arithmetic.L.Computer Arithmetic in the 1950s The focus shifted from feasibility to algorithmic speedup methods and costeffective hardware realizations By the end of the decade. CORDIC algorithms were proposed and implemented Snapshot of the field circa 1960: Overview paper by O.
Computer Arithmetic in the 1960s Tree multipliers. led designers to novel and costeffective solutions Examples reflecting the sate of the art near the end of this decade: IBM’s System/360 Model 91 [Ande67] Control Data Corporation’s CDC 6600 [Thor70] May 2010 Computer Arithmetic. when the demand for high performance. highradix dividers. along with the still high cost of hardware. Implementation Topics Slide 91 . redundant signeddigit arithmetic were introduced Implementation of floatingpoint arithmetic operations in hardware or firmware (in microprogram) became prevalent Many innovative ideas originated from the design of early supercomputers. array multipliers. convergence division.
pipelining methods were perfected to allow the throughput of arithmetic units to keep up with computational demand in vector supercomputers Examples reflecting the state of the art near the end of this decade: Cray 1 supercomputer and its successors May 2010 Computer Arithmetic. Implementation Topics Slide 92 .Computer Arithmetic in the 1970s Advent of microprocessors and vector supercomputers Early LSI chips were quite limited in the number of transistors or logic gates that they could accommodate Microprogrammed control (with just a hardware adder) was a natural choice for singlechip processors which were not yet expected to offer high performance For high end machines.
carrylookahead adders. Similar ideas were applied to more efficient VLSI tree and array multipliers Bitserial and online arithmetic were advanced to deal with severe pin limitations in VLSI packages Arithmeticintensive signal processing functions became driving forces for lowcost and/or highperformance embedded hardware: DSP chips May 2010 Computer Arithmetic.Computer Arithmetic in the 1980s Spread of VLSI triggered a reconsideration of all arithmetic designs in light of interconnection cost and pin limitations For example. Implementation Topics Slide 93 . were shown to be efficiently realizable after suitable modifications. thought to be illsuited to VLSI.
Computer Arithmetic in the 1990s No breakthrough design concept Demand for performance led to finetuning of arithmetic algorithms and implementations (many hybrid designs) Increasing use of table lookup and tight integration of arithmetic unit and other parts of the processor for maximum performance Clock speeds reached and surpassed 100. and 500 MHz in rapid succession. pipelining used to ensure smooth flow of data through the system Examples reflecting the state of the art near the end of this decade: Intel’s Pentium Pro (P6) Pentium II Several highend DSP chips May 2010 Computer Arithmetic. 400. Implementation Topics Slide 94 . 200. 300.
the IEEE 7542008 floatingpoint standard May 2010 Computer Arithmetic. and approval of.Computer Arithmetic in the 2000s Three parallel and interacting trends: Availability of many millions of transistors on a single microchip Energy requirements and heat dissipation of the said transistors Shift of focus from scientific computations to media processing Continued refinement of many existing methods. Implementation Topics Slide 95 . particularly those based on table lookup New challenges posed by multiGHz clock rates Increased emphasis on lowpower design Work on.
2 Early HighPerformance Computers IBM System 360 Model 91 (360/91. mid 1960s) Part of a family of machines with the same instructionset architecture Had multiple function units and an elaborate scheduling and interlocking hardware algorithm to take advantage of them for high performance Clock cycle = 20 ns (quite aggressive for its day) Used 2 concurrently operating floatingpoint execution units performing: Twostage pipelined addition 12 56 pipelined partialtree multiplication Division by repeated multiplications (initial versions of the machine sometimes yielded an incorrect LSB for the quotient) May 2010 Computer Arithmetic. for short.A. Implementation Topics Slide 96 .
To St orage From St orage To FixedPoint Unit The IBM System 360 Model 91 FloatingPoint Inst ruct ion Unit 4 Regist ers 6 Buffers Inst ruct ion Buffers and Controls Regist er Bus Buffer Bus Common Bus FloatingPoint Execution Unit 1 FloatingPoint Execution Unit 2 RS1 RS2 RS3 RS1 Multiply Iterat ion Unit RS2 Fig./ Div. A.1 Overall structure of the IBM System/360 Model 91 floatingpoint execution unit. Unit Propagate Adder Result Result Bus Computer Arithmetic. Implementation Topics Slide 97 . May 2010 Adder St age 1 Adder St age 2 Add Unit Result Mul.
5 ns Used 5 integer/logic function units and 3 floatingpoint function units Integer/Logic units: add. logical 1. multiply (7 stages). logical 2. reciprocal approximation (14 stages) Pipeline setup and shutdown overheads Vector unit not efficient for short vectors (breakeven point) Pipeline chaining May 2010 Computer Arithmetic.A.3 Deeply Pipelined Vector Machines Cray XMP/Model 24 (multipleprocessor vector machine) Had multiple function units. each of which could produce a new result on every clock tick. given suitably long vectors to process Clock cycle = 9. Implementation Topics Slide 98 . weight/parity Floatingpoint units: add (6 stages). shift.
A. Implementation Topics Slide 99 . May 2010 Computer Arithmetic.2 The vector section of one of the processors in the Cray XMP/ Model 24 supercomputer.Cray XMP Vector Computer Fig.
Implementation Topics Slide 100 . B X. X1.. X0. A X1. e. A {A+BB} {A–XA} { X1 X0 B } { A Y1 X1 A } { A AND X1 A } Generalpurpose DSPs come in integer and floatingpoint varieties May 2010 Computer Arithmetic. B Y1.g.A. A X1.4 The DSP Revolution Specialpurpose DSPs have used a wide variety of unconventional arithmetic methods. RNS or logarithmic number representation Generalpurpose DSPs provide an instruction set that is tuned to the needs of arithmeticintensive signal processing applications Example DSP instructions ADD SUB MPY MAC AND A.
3 Block diagram of the data ALU in Motorola’s DSP56002 (fixedpoint) processor. May 2010 B Shifter/Limiter A S hifter/Limiter 24+ Ovf Computer Arithmetic. Implementation Topics Slide 101 . Rounding. and Logical Unit Shifter A A2 B B2 A1 B1 56 56 56 A0 B0 Accumulator Registers Fig. A.FixedPoint DSP Example X Bus Y Bus 24 24 X Y X1 Y1 24 X0 Y0 24 Input Registers Multiplier 24 Accumulator.
Implementation Topics Slide 102 .FloatingPoint DSP Example X Bus Y Bus 32 32 I/O Format Converter Register File 10 96bit. or 10 64bit. May 2010 Computer Arithmetic. A.4 Block diagram of the data ALU in Motorola’s DSP96002 (floatingpoint) processor. or 30 32bit Add/ Subtract Unit Multiply Unit Special Func tion Unit Fig.
led to the 80x86 = IA32 ISA Half a dozen or so pipeline stages 80286 80386 80486 Pentium (80586) Pentium Pro Pentium II Pentium III Celeron Pentium 4 May 2010 More advanced technology A dozen or so pipeline stages.5 Supercomputers on Our Laps In the beginning. there was the 8080.A. Implementation Topics Slide 103 . with outoforder instruction execution More advanced technology Instructions are broken into microops which are executed outoforder but retired inorder Two dozens or so pipeline stages Computer Arithmetic.
Implementation Topics .6 / yr GIPS Pentium II R10000 Pentium 80486 80386 68040 MIPS 68000 80286 KIPS 1980 May 2010 1990 2000 2010 Slide 104 Calendar year Computer Arithmetic.Performance Trends in Intel Microprocessors TIPS Processor performance 1.
5 Key parts of the CPU in the Intel Pentium Pro (P6) microprocessor. etc) Port 2 Port 3 Port 4 Port 0 80 80 80 Reservation Station Port0 Units Shift FLP Mult FLP Div Integer Div FLP Add Integer Execution Unit 0 Reorder Buffer and Retirement Register File Port1 Units Port 1 Jump Exec Integer Execution Unit Unit 1 Fig. May 2010 Computer Arithmetic. 28. Implementation Topics Slide 105 .Arithmetic in the Intel Pentium Pro Microprocessor Dedicated to memory access (address generation units.
and Resources Current focus areas in computer arithmetic Design: Shift of attention from algorithms to optimizations at the level of transistors and wires This explains the proliferation of hybrid designs Technology: Predominantly CMOS. Outlook. Implementation Topics Slide 106 . with a phenomenal rate of improvement in size/speed New technologies cannot compete Applications: Shift from highspeed or highthroughput designs in mainframes to embedded systems requiring Low cost Low power May 2010 Computer Arithmetic.6 Trends.A.
and digitserial arithmetic as mechanisms to reduce the VLSI area and to improve packageability and testability Synchronous vs asynchronous design (asynchrony has some overhead.Ongoing Debates and New Paradigms Renewed interest in bit. but an equivalent overhead is being paid for clock distribution and/or systolization) New design paradigms may alter the way in which we view or design arithmetic circuits Neuronlike computational elements Optical computing (redundant representations) Multivalued logic (match to highradix arithmetic) Configurable logic Arithmetic complexity theory May 2010 Computer Arithmetic. Implementation Topics Slide 107 .
. advancements. ARITH2/3/4 symposia Fig. Implementation Topics Slide 108 . floating point Pipelined arithmetic. signeddigit. asynchronous design. microprocessor. deep pipeline. innovations. A. table lookup VLSI. nanodevice arithmetic Power/energy/heat reduction. May 2010 Computer Arithmetic. carrysave multiplier. CORDIC algorithms Key ideas. hybrid design. technology traits. highradix multiplier. FPGAbased arith.Computer Arithmetic Timeline Snapshot [Burk46] [Shaw50] [MacS61] 1960 Decade 1940 40s 50s 60s 70s 1980 80s [Swar90] 90s 2000 00s [Swar09] 10s 2020 Teraflops on laptop (or pocket device?). IEEE 7542008 CMOS dominance. restoring divider Carrylookahead adder. circuitlevel optimization. SRT divider.6 Computer arithmetic through the decades. vector supercomputer. embedded system. highradix & convergence dividers. stored carry. carry chains. media processing. and milestones [Ande67] [Thor70] [Garn76]] Tree/array multiplier. digital signal processor. online arithmetic. IEEE 7541985 Binary format.
and the one after that. And then suddenly. all washed up. what you’ve achieved. The phone will ring and it’ll be the administrator –– talking about budgets. Until at the end of the day you’ll wonder what happened to it and what you’ve accomplished. never say. and another. and the next. Then they can never touch you. and another. one day. Listen to an old man who’s been through it all. The Final Diagnosis May 2010 Computer Arithmetic. It’ll be tough to do.The End! You’re up to date. The doctors will come in. Then you’ll get the salesman. “He’s finished. and read and learn and listen and keep up to date. That’s the way the next day can go. Until you find a year has slipped by. and they’ll want this bit of information and that. who made the mistake of falling behind. Implementation Topics Slide 109 .” Arthur Hailey. That’s when it’s too late to change. you’ll find everything you knew is out of date. Don’t let it happen to you! Lock yourself in a closet if you have to! Get away from the phone and the files and paper. make no mistake about it. he belongs to yesterday. Take my advice and try to keep it that way.
8:97. 5.70 ./0..3/ .798 .9.9 . .90 I I I 24/ :95 .7. L . .0.3/0/ 805.
L .0.3 $/0 425:90779209. L 24/ 79209. !74.:9 . 2502039.08847 24/ 0.08847 425. !74. .08847 9708/:0.8 .3 79209.86:..70 7747 3/.574.947 .943%45.70 7449 /11.
70 49 06:. 774 7 $/0 425:90779209.0///07 2 4 / 2 4 / // //2 4/ 3/ 2 4 / 4 2 5 .250#08/:00.943%45. 2502039. 8 82 4 / ..8 .
090 24/ 24/ 708/:0.943%45..79209.70/1107039 . 2502039.309 .077478 .8 .79209.934/08 !489. 077478390 708/:0.0 $3/7420 24/ 24/ 07747 $3/7420 24/ 24/ %. 0.90 8259428398 9.9.0.:80.79209.0 07747 .4/0 0.7747 4770.4/0 $/0 425:90779209.9.0 774783/74208147 09 .4770.
24/: . .L .3/ .489.70/:3/.0.8:554798. .057204 . .9.9.4/0970.3.4770.9841/.943 #05708039.147 09 07747.943.!7450790841708/:04/08 708/:0.
.. .
. .
8 $/0 .943%45. 425:90779209. . 2502039.
.4/0.3:3.943&398 $01 .0.07 .0789.90/ 33..90//438970..343.80/ 470.70/809/408349 .8 $/0 .90.3 $ :39.290309801 .4/047/ 4:95:9 /090.0.90.030390.930.90/220/.$01 0.94:95:9 2.0.4/0. 2502039.31./894.343.:9 . 425:90779209.701.943%45.110.07 47 !745.:91742.0.990./.3:39 %4:/$:398 0300/$./708:98 090.4770..5708.4/047/ 0.
4/0.07 $/0 425:90779209.:9 9 9 35:98 .07 .78 .4/0/.801 . %4 35:9 .4/0 24/ 84:/090980.3.08341.4250203941 '07139.983. 2502039. 47 890 8.5.250$.$01 0.20.3 .8 47 47 .8 .34/00.70.0.0.7.943%45..0780708/:0.0.813/39041440.3/ 147:803.071473.:0803.
80$9:/$01 0.3//078 ..
947 35:98 .4/0/ 570/. !.79 03.79 !.79 0307.947 !.
!.4/0/ 4:95:9 .79 03.
7& 7747 . & $01 . !.!.0.79 03.79 03.79570/.3/4:95:9 7/3.4/0/ 35:98 .943 !.4/0/ 35:98.3.//078 95.79570/.943 .
.
!.
# .
.
& !.79 03.4/0/ 4:95:9 #.
! .
!.
39 .43.07907 #.#!.79 94 70/:3/.
79 .43.39 94 5.!#0/:3/.07907 !.
# .
!.79. #0/:3/.79 570807.395.3& .
3/70/:3/.39.70/:3/.39.
.5.8 $/0 .943 .!.79570807.79.43. 425:90779209.07843 . 2502039.943%45.4/0.
8.3.. 0/ 3.4770. .9 4507.059 9.9438 2.9.9.9 708:98 090.0 897..90 94 07747 /090.80/.4792 .:9%407.943 .1907 0.943 . 4507.0 9073.
0.4:23 .08 24/ 7 .8 .943 41 2.250 2:95. 425:90779209.9:70 47 . 1 L 2.0.943 0.. .3/1 $/0 .3/ 1: ..943%45.9 077478 ..08 .0 .3/ 0/3 ! #4 .9 /.4770.97.08 7 .4:23 .3/1: .8:2 2.97.97.8:22.97 998 74 . 897:.9. 2502039.55.
90/ .943%45..08 %04702 1 ! L 0 .55742.3 0 . L 7 9 14.908 ..4770. 425:90779209.97 .70 .3/ . 2502039.3 830 0774304:8 002039 .3 9700 077478 .8:2 2.97.0.0 !1 . 1: .. 1 %04702 3 .:08 90 06:.8:2.90 7 .!74507908410..93 5439 .3 0 /090.8 $/0 .90/ .
3924/::82:89090.2 &804907708/:0894.800903843 425.2.70.3/.007747/090.943 9070/:3/.8.79209.4.24/ 2 708/:08 %00.425:90/.830708/:0/090.:941982094/89.39708/:08 097. 2502039. 425:90779209.8 $/0 .3/.4770.9:.3/0110.7.020820.39#$79209.24/: ..:9 %407.425090 :3.9.39. #08/:03:207889028.425:9090708/:041903:20724/2 98574.7.943%45.7090.0 7747/090.9.43130/94.7089430 8.0888343.384170/:3/.9.47928.110.250 #$ 3./0548808250903/390 /3.90/07747/090.3924/::8307747.304190#$ ..9438.2..943 .94382.30 L L L 70/:3/.0700.
2502039.3/70.39#08/:08 #$ 9 70/:3/.3/ #05708039.90/ 3:207 %0/110703.99007747 .250#$994#0/:3/.43:894..43897:.943%45.39 24/: .381472 #$ #$ 94 .43899:908. 425:90779209.477:590/ .07843 %7..943 41 477:590/ .43897:. .8 $/0 .80 0903843 #$ #0.4254303984190.90/3:20788 %8.4770.83/7420 .00900390178994.
79209. ./078 %.5907 4..3/ /.9073.8 .3/ 07 0.08 .90 147 2502039.0 .5907 98 $:9. 425:90779209. .79209.3/ /897:90/ . 3/ 4 .8 $/0 .//07 /0838 043/ 7550 .079209.#0.36:08 147 1:3.47928 . %0.08 147 2:95078 .557457..943 43 !8 430 41 .230 .4.80/ .0 .3.943 3. 2502039.:.77 083 .0 .3/ /0838 .9.943%45.0/ !8 .4:20 57494950 889028 .943 0.431:7.
5907 !747.079209.431:7.:.0 4.3/ . 0.30/ 0.:.3/ 897:90/ 79209. :3.943%45..943 .943 43 !8 043/ 30 7.#0..8 %45..7 ./07 0838 %. 2502039.08 //07 0838 147 !8 :9507 . 425:90779209.08 .%45.8 3 %8 .8 $/0 .22.
!747.08 35:9 8 .04..22.0.
4. .
2502039. .22. 4.4330.0 3907.22.8 $/0 .0806:039. !4794341!98947.4330. 4.2508415747. !747.8 : " " : 4.4.04:95:9 .9438 . 425:90779209. 431:7.4.:8907 4.0 47.3! .0 !747.897:.22.98 .9:70 41 .943%45.0307.
22.8 $/0 .0 .943%45.3828 $/0940.4330..439740/89.!.90.388947 0247 .0 0247 ./0.90:1107 .943835747.!747.08 . %789.0 .22. 2502039. :95007 $42020247 .425090/ 0247 .08.3/ 3907.8897.04.. 425:90779209.
.4.9:7041. 47 &% :95:98 .4.82504.77 4:9 35:98 4.8 $/0 . 425:90779209. 2502039.943%45.431:7.77 3 $97:.04.8 .
:8907 $9.:8907 '079.9.:8907 $9.8 $/0 . 4 47 .:8907 $9. 47 .%03907.7.22.3308 47 . 4 47 .4330.3308 425:90779209.77. 4 47 .:8907 54880 . 73 .943%45.73.:8907 $9...0 3907.:8907 47439.4330..30203941 5747. 47 . 4 47 .9809003 847.:89078 .:8907 47 .:8907 47 . 2502039.
094 2024700203989390!/0.0 '071.2104189.0 94850...709!/0.943 0307..4.1.:900203984189.3/90893 .:90020398 !. 58.08:.0 !747.943 70.553419058.70/08. 425:90779209..079390/0831083943907.58.94398147909..4770..3/.943841909.9.9434190706:8909 8970.3:.93088419013.4330.943.70/389......04170.0 #4:93 .223 &54.7...:900203989.1..82:.709!/0.94385708.4330.3/490789.7/4.709!/0.1.7/.0 431:7.908..5.094850.7/!0834 $50.943.1..8 $/0 .0 94850....943 38:7390.799433 8833904.9 4/8.02039 ./3909 8970.75943./083 3 9072841491:3.2109.7.3902 !.7.70.431:7.9390/083108 95.943%45. ...3/.90/ 30947841.58. 2502039.00203984189.3/923 .$9.70843909.8'074 ' 470 $39088 43.55341903907..
2502039./8.//078.:8843417550 .77.77 .//070838147!8 %88/0943.943%45. 425:90779209.:/0.8 $/0 ....383!8 .3/:9 3.
.77 $5//943 $/0940.425090/ .
.
.
.
.
.
4:9 //07 . . $5 4.
.3 //07 .
//07 .
//0743.8 $/0 .943%45. !4880/08341.3! . 9 .. 425:90779209. 2502039.77 85.
425090/ 98 98 98 98 9 .9//943 $/0940..77 $00.
.
.
.
8 $/0 ..3! .//0743.9. 2502039. !4880/08341.77 800.. 425:90779209.943%45.
5 5 5 5 5 5 ..436:07L 2:9507/083 :83 35:944:59.3/ . .3/.943%45./070838 $/0940.08.425090/ . 425:90779209. 2502039. .8 $/0 .//07 9. 5 5 &%8 .3/7550 . . ./0 .//078 .77.:9507. .4:9 9.//07 . .
398 &%8 $/0940.425090/ .9434389.:95..
.
.
94341.. &%8 9.//07 :95. 425:90779209.8 $/0 .943%45. 2502039.3 935:9 :83&%8 .
.8 $/0 . 2502039.943%45.84343!8 $/0940.425090/ . 425:90779209.
%.7. 2502039.943%45.425090/ . $/0940. 425:90779209.8 $/0 .:.3/897:90/79209.
3/9006:.425 .4:835:98 %4570. 9 . . 907 .$0. 8. .3/8 8 . .3/ 8 . 425:90779209. 8 . .9.943%45.943147 39072841909834507. : 1 1 .4:84:95:98 5.  01301 8 9 : .398 :77039. .43/ 7/079.425 070908:22.4389.   .70. .8 $/0 .3/.907013943 . . 2502039.94387.3/94570. .301742 94 . 8 .
9079 $07.0 44:5 # 9.:.8 $/0 .943%45.0 #9 $19 $19 9 #0 45.9 9003/ 41.94341.943 35:9 9 35:9 $ 1789 //70883 $19 #0 2 9 #08907 94:95:9 0314720/ #08907 :95:9 $19 #08907 $19 #0 1 .43/ 47/071907 .43/ 7/079. 425:90779209..770. :9 ] 8 9 35:9 9 4:95:9 9 35:9 $19 #0 397 0397 %.9.0 4:95:9 :95:9 $ 1789 9 807. 2502039.$0.2502039.80..9.
425090/ //.. 0 $/0940.943.:3.94343!8 $34.:.
$: //.
$: //.
//.$: 0 $34.
$: //.
$: //.
//.$: 0 $34.
$: //.
$: //.
3:3740/ #574. //.08847 . $34.$: 0 %0178914:7 89.0841.
$: //.
$: //.
8 .$: $/0 425:90779209.943%45. 2502039.
43.9430.0703.943%45.:..0 8905 43.425090/ 44:5 9..0208 $/0940.0703.0703.0 43.2502039343.8 $/0 . 2502039.9:701471:3.0703.0 8905 43.0 8905 1 0307. 425:90779209.0703.0897:.0$.943 .
..08847 ! ! 0/ 5747.. 47//9 98 %0/08385.77.9438 . .30/0... $! 5:75480 574.0. 2502039.8 $/0 .943%45. 39038. ! 5:75480 2.0 :7 .08 3897:.55..22.08847 574.5574.0147.943/059 !! $/0940.043/30 7.79209.74 574.425090/ 0307.79209. 425:90779209.08847 $50.
706:7020398 O %0. 3 . 425:90779209.3/ 0.9 .89 !708039 .3/ 1:9:70 /0. 2502039.5 938 :5 574.9438 ./0 507850. /70.0 . 88902 /083 4.7.04520398 :77039 9703/8 .8 7.89 .!.3/:9:70 5503/ 4.8 $/0 .9. 10 0 889028 5503/ 98 30 2:89 44 . .3/ 7080.8 O !. 3 .943%45.43897..943.07.43909 41 O 425:9.398 O .79209.230 .344.79209.
08 .947 .308 %0 $! #0.8 %45.9.3/:9:70%45.3/ #084:7.8 3 %8 . 425:90779209.0 . 2502039.89 !708039 .8 $/0 .425:9078 43 %703/8 :7 .0 425:9078 005 !5030/ '0.4:943 $:507.5907 8947.7 !071472.943%45.!.. !07850.3..58 :944 .
843 4/07370.0 ...94317420.798 995.345...7041/0.9..8.43897:.//943 ..//943 .0.3/7089473/.8 ..77 8..!07850.0.77 85.88:..8947.
.
..34:8.20.
/110703.0*03308.
943%45. .8 $/0 . 2502039. 425:90779209.
3574.7/.8941.200.79209.39010..8.390 8 ..7:.9743.30..70147./0.425:90779209.425:93 989470/ 5747.//943..08 .
3/7089473/.47928070/0.07.3/..425:907.8:97.843.7..943 :8041.38 1..943.843 $9.4..//943 .943%45.943.439.3/ 130 9:30/ 8023....705479:708 4/8903 .//2:95.943 .8 $/0 . ( ...0450/.42502039705708039. 2502039. .775745.30//0.430:2.892:95.943.843.33 .05.9041.0.$. 425:90779209../ .3/819 ..79209..77 8.507#$.0413:2077.3//.
425:90779209../0035:80/4707039013.479280705745480/ .0. . $#%/.7. 2502039./ .:88190/174210.507 ..8994.$470.9.5.79209.7070.70.843 #.89 ..79:.489 0110.0452039 #08/:0..4792.425:90779209.8500/:52094/8.9438 9003/4190/0.25479.391.3/25020390/ $3.3/ ./0 .390 8 %014.943%45.5849419010/.8 $/0 .80841/0..05..$( .//07/0838.7/.07.
//..904190.39830/ /9.4399089.79209.0703.20570.3/..843 70/:3/.390 8 %7002:95078 .90/174290/083410.8473.425:9078 0390/02.93908.039 .2 0.0/0.7047 172.3334.93 5439.9.79003/4198/0.0/ 2502039.4507.94383.0703974/:.70 32.0 /.25087010...77.745747.7/.3/1475071472.9.425:90779209.7930.2:95078 7.43./0 8$8902.7 8:507.3.9434114.084:9438 .79209.70 0//0830789434.0 ./078 .489 0110.7/.48941.0 .
47547. 2502039.9. 425:90779209. 4/03/0( 43974.8 $/0 .943%45.943 8 %47 ( .
0147830 .8.9478:507..943.0.90 .390 8 /.43974 9:89.425:907.3889478474.8 $/0 .088478 .425:9.:39894005:59.220/.07034909050.90/94./0 7.425:9078 .425:9078 . .//07 .9478:507.79003/4198/0.425:90779209..4:/.3.580706:90290/3903:2074197.0 4703/2.7$.90/9441107 5071472.70.3/988:.4 90974:5:941.9089.039412.0..7930.4224/.7/. /02.088478..943%45.3/.4. 2502039.3/3.745747.990. 425:90779209.5574..088478.904190..25087010.74574.8:507. .3.939089.9:7.79209.308 550332094/807050710.
3/.3.94380.7/0. 2:95078 9 807.3/.08831:3.4330.489..070.943.03970.19078:9.70.8070.77 44.039'$9700...07053 29.980../08383 9413907./.9438 470.//078 94:9940 8:90/94'$ 070843940011..083.390 8 $570.0/94/0.20/7.94383'$5.79209.9438 $2..425:90779209.024/1.250 .550/942470011.3/5329.0.94341.489./.08 1474 .0.3147./41'$97070/.574. 39038.77..438/07..79209.08 79209..3/43 30.
7/. 2502039.58 .8 $/0 .70$!.943%45.0020//0/.47 5071472.3. 425:90779209.
425:90779209.00/94130 9:3341.58 .79003/4198/0.2:25071472.37//0838 3.79209.3/1475071472.0 4.83:80419.3/ 2502039.939089..43.3/8:75.07.9438 2.8500/870.0/..3/ 49075. 2502039.904190.880/ .974:/083.94341. 03/$!.79209./0 390 8!039:2!74 ! F !039:2 $0.3.70.25087010.088471472.47928..5/8:.3/993907.059 02.390 8 470.3. 425:90779209.943%45.9. 974:9088902 .044:5.0884355033:80/94038:70824491441/.7984190574..3/ 37.:39.8 $/0 .7930.
425:90779209.88434 5407/083 4743 ..908 3.3/.5 307706:7020398.94341908.7/ .3889478 $194114..9412.3/.5574.03085480/2:9 .308932094/8 5..390 8 %7005..41 90 14.9/885.80/025.7.79480 .7.74.93 543989.79.8302.388947843.574.3/0.943%45.8 $/0 .0883 4393:0/701302039412.4.80/439.:817428.044:5 0.324384197.70.94389420/.0..:.939703/8 ./97.3/3907.425:9.0391. 425:90779209. 2502039..
7 !071472..3.0425:9078 $8902 4/0 .
7941.4792949.0 38 6:90.90/2:95.943 809.9314.7.3 .39. 425:90779209..7/.47.:943:39850714723 %4 89.943:398.93 543900.:770394507. 2502039..07843841902.90.0.943%45.33.0 4.79.0/:3.0419021475071472.8437050.8 $/0 .7088.014798/..43.70..1.4770.3089908.908.203897:.30 842092080/0/.055030/..9$147906:49039 .9:70 ..3/39074.//943 L 55030/5. 97002:95.3../2:9501:3.30.943 .3/.. 14784792/ 8 !.2412./.9438 39. &80/.
07.3/ 439748 #08907:8 :1107:8 42243:8 4. 897:.943 &39 #089078 :11078 3897:.%4$947.0 %40/ !439&39 %0 $8902 4/0 4.:943&39 4.93 !439 0.0 742$947.943 :11078.93 !439 0.:943 &39 #$ #$ #$ #$ #$ .9:704190 $8902.93 !439 3897:.
93 5439 00. //07 $9.:943:39 . 4/0 14.0 // &39 #08:9 :.0 //07 $9.
&39 :95 907. .. 2502039.90 //07 #08:9 #08:9:8 425:90779209.943 &39 !745.943%45.8 $/0 .
!.308 7.947.005!5030/'0..
038 &80/39007.0. .30708:9 430.4/0 2:950 574..0.9.07.41..4:/574/:..08847..0.088 4.4.43..9472.943:398 0./2:9501:3.947894574.038:9.30 .
1:3.943:398.3/14.93 54391:3.4.943:398 39007.
4. 4. 09.// 819 4...:398.
947:39349011.8 $/0 .070.943 89. 425:90779209.035439 !5030.943%45.0391478479. 2502039.93 5439:398.3/8:9/434.55742.08 2:95 89.9478 70.574.08 !5030809:5.0.// 89. 0.79 4...5.33 .08 70../8 '0.
09. ! '0.947 #089078 ' ' ' ' ' ' ' 742 //7088 &39 '0.947 39007 &398 // $19 4..7. 4.947 425:907 '0..
!.79 $9.0884783 907.943 414304190 574. !.0.94780.93 !439 &398 // %0 .08 ' 4.
947039 .425:907 . 5574 $9.8 43974 %4.8 '0.08 43974 $3. 4/0 8:507. :95 #0..574.
.1742$.7&39 425:90779209.943%45..8 $/0 . 2502039.
2094/80 #$474.03943./0.55. 5:75480$!8.9438099.79209.574. 2502039.792.9438 .250$!3897:..70908 .. 39038.9438 $& ! I I F < F < IL F < I L F < F < 0307.943%45.0:80/.083..3/14.70941:3.989:30/9490 300/841.3:207705708039.4:943 $50.0883.943 0307. 5:75480$!8574./0.43.8 $/0 .%0$!#0.93 5439. ... 5:75480$!8.420339007.33897:. 425:90779209.79209.
.947 #4:3/3 .&3 49474.08847 .7.9.250 :8 :8 35:9 #089078 :9507 .0/ !439 $!. $1907./.2 4190/.&39 $1907 .:2:. 8$! 10/ 5439 574.947 #089078 4..3/ 4.:2:..
2907 $1907.
8 $/0 .943%45.1 425:90779209. 2502039.2907 .
4.250 :8 :8 .93 !439 $!.
472.07907 #089070 9 47 9 47 9 //.943.
$:97..9 &39 :95 &39 $50./. 8$! 14.93 5439 574..08847 .9.943%45. :3.7.943 &39 4.8 $/0 . 425:90779209. 2502039.&3 49474.2 4190/.
08 !039:2 !039:2!74 !039:2 !039:2 00743 !039:2 ./../.1.08 94:9 41 47/073897:.3.94300.9438.943%45.70 00.3.8 $/0 470.58 3900333 9070./40347845503089.:90/4:9 41 47/07 :970970/3 47/07 %4/403847845503089.890 0/9490 $ .74 458.:943 470.425:907843 :7.344 3897:.0/ 90.. 425:90779209.344 /40347845503089.707403 3942.. 2502039.08 .0/ 90.$:507.
3.3.088475071472.0 L.!071472.0%703/83390.74574.088478 %!$ !74.
$/0 .8 .943%45.70.03/.7 425:90779209.7 !$ !039:2 # !039:2 !$ !$ . 2502039.
08847 .79209.:943 &39 & 39 05.08847 0/.74574.943%45.943 #047/07 :1107.:943 &39 !479 #0807.3/ #09702039 #08907 0 !479 !479 &398 :25 39007 0. 2502039. !479 !479 !479 !479 & 398 $19 !:9 !.. !// 39007 0.74574..7984190!&390390!039:2!74 ! 2.//7088 0307.088 ..943 $9..390390!039:2!74.90/94 20247.8 $/0 . 425:90779209. 39007.943 :398 09. 0.
943417//0838 %0.990 0. 083 $1941.3/708 %805.02039380.425:907.3890574107.:8.9041 2574.04197.9438.%703/8 :944 .79209.08 :7703914.7.47928944592.83.344 !70/423.99039431742.39 $ 9.5034203.70.3/#084:7.3889478.
.489 45407 . 425:90779209.9438 $191742 8500/47 974:5:9/0838 32.425090 55.8 $/0 .34408.943%45.317.20894020//0/889028706:73 4.8500/ 090.3349. 2502039..
3//9 807.743.9.3/.908..0.83.7434:8..884204..3/942574.090'$.070./ :9./147.3430./28 #0300/390708939 .0394..8.070..4.79209.83.3/9089..9 $3.05.306:.7434:8/083 ./8035./897:943..70.382894 70/:.3/0!.820.7.
79209./282.. 79209..90790.:98 0:7430./.047/083 .:0/4.425099047 .04..943.0.7..943%45.39705708039.3.94 7.425:93 70/:3/.9..9438 :9.7.0020398 59.943 0/0835.425:9.79209. 431:7. 425:90779209.8 $/0 . 2..47 8894. 2502039.
7 ( 8 8 $..7 ( 8 %07.. !407.0.425:90779209.5945 4754.%2030 $3.0 .$( 3/0( %47 ( .79209. ( .7434:8/083 3.09/0.5849 :7( $..145843.34/0.83.73(( 8 $.
0307.
943 20/.0 .04592.943 7//083 /00555030 9.574.:9 0.0883 ! .3.044:5 '$ 020//0/88902 /9.574.08847 43 30.7.80/.79209.79 $/423. 8 8 %700.0.970/:.83.
0/. .0703.425:907 2.74574./078 830/ /9 14.77.79209.2:9507 7.9478:507.0../ .43.08847 #% .935439 !5030/.
.
9438 .47928 0/0... 0.020398 90./07 #..79209.8 334..974:90/0.71472.3./.77 ./2:9507 $#%/./07 ..38 89470/.77 8.77 44./.34497.943%45.8 $/0 .//07 7.98 .02:9507 7089473/./0 8 8 3.3/20894308 425:907. 425:90779209.9 .77.0.82548. 2502039../08 ....
4:78013.3/ 0.082.943%45..4:99%0543073.947 9. 79:7.30 890394.%03/ 4: 70:594/.0.3488 ..3/. 4: 13/0.34907 .041 1.34 .3/ 90 .3/90309 .0/ %.3/005:594/.07934:3084:941/.303/43 9099. 42.0794:.42580/.3/9 090 .3/5..078.9 0 94:94/42.9.90309/.349073/903 8://03 430/.4: 43/07.3.19079.0.0.73../090289.90%0390.399894131472.90%.8 $/0 ./23897.9 8 039 8944.80/:5004389408907/.3/..5503944:4.34/2...02.3 &39.480914: .09409.0 %03.88550/ .3/8903.9478.99003/4190/.943.4:9:/098%0/4./.4203 .4: 30.9%034: 09908. 425:90779209.0....034289.3/979400599.94: .3/90430..55030/949.9094.330.9.7.3/9.507 . 0 81380/ .9 4: .0.0..3/.90%.9 890.1742905430.3/70.34 8003974:9../.9 &394:13/.3/90108. 2502039.