You are on page 1of 13

Distributed Arithmetic

Dr Sumam David S.
Dept. of E&C, NITK Surathkal

Courtesy for slides – Xilinx Professor’s Workshop Resources


Objective

 Distributed arithmetic
 What ?
 Where ?
 How ?
What is DA?

 Multiplication using LUT


 Used to implement multipliers in LUT rich
FPGAs
Twos Complement Multiplication

 One bit at a time:


SDA 1-Tap FIR Filter

N BITS WIDE
SAMPLE DATA

A0 Partial
Parallel
X0 1 Product +/- Z-1
ROM
to serial
converter Scaling Accumulator

A0
0 00000...0
1 C0

LUT contains two locations


Distributed Arithmetic
for a 2-Tap Filter

 Partial products of equal weight are added together before being


summed to next higher partial product weight
 Create look-up table of summed partial products

-23 22 21 20 -23 22 21 20
C0 = 1 0 0 1 (-7) C1 = 0 1 1 0 ( 6)
X X0 = 0 1 1 1 ( 7) X X1 = 0 1 0 1 ( 5)
( 1 0 0 1 + 0 1 1 0) 1111 (-1)
( 1 0 0 1 + 0 0 0 0 ) 1001 (-14)
( 1 0 0 1 + 0 1 1 0 ) 1111 (-4)
(0 0 0 0 + 0 0 0 0 ) 0000 (0)
1 1 0 0 1 1 1 1 (-49) 0 0 0 1 1 1 1 0 ( 30) =11101101 (-19)

= Sign Extension (Serial-Data / Tap-Parallel Multiply)


SDA 2-Tap FIR Filter

N BITS WIDE
SAMPLE DATA

X0
A0 Partial
Product +/- Z-1
A1
X1 1
ROM
Scaling Accumulator

00 0000...0 LUT contains all possible


01 C0 sums of the partial
10 C1 products
11 C0 + C1
SDA 4-Tap FIR Filter
N BITS WIDE
SAMPLE DATA

A0
X0
0000...0
C0
1
A1
+
X1
0000...0
1
C1 Partial Z-1
+/-
+
A2 0000...0
Product
X2
C2 Scaling
1 ROM Accumulator
A3
+
X3
0000...0
C3
SDA 8-Tap FIR Filter
N BITS WIDE
SAMPLE DATA
A0
X0
1

A1
X1 Partial
1

A2 Product
X Pre-Adder
2 1 ROM
A3
X3
+/- Z-1
1
+
A0
X4
1
Scaling
Accumulator
A1
X5
1 Partial
X
A2 Product
6 1
ROM 4 -input LUT contains all
A3 possible sums of the
X7 partial products
Xilinx DA FIR Performance
60 6000
Sample Rate (MSPS)

Single MAC

Performance (MMACs/s)
Dual MAC
50 DA FIR B=8 5000 DA FIR B=8
DA FIR B=12 DA FIR B=12
40 DA FIR B=16 4000 DA FIR B=16
30 3000

20 Serial FPGA 2000 Serial FPGA


FIR FIR
10 1000

0 0
0 50 100 150 200 250 0 50 100 150 200 250
Filter Length (Taps) Filter Length (Taps)

fclk = 200 MHz for both processor and FPGA


B = data sample precision for FPGA
Trade Clock Cycles
for Logic Area
Trade Clock Cycles for Logic Area
20Ms/s Multi bits per clock cycle 160Ms/s

b7 b7 b7
Serial-DA Parallel-DA
b4
b3

b0
Hardware b0 Hardware b0 Hardware
b0
Over-sampling = 8 Over-sampling = 4 Over-sampling = 2
b 7 b3

Hardware
Over-sampling = 1
b4 b0
The sample is serialized The sample is serialized
and processed 1 bit per and processed 2 bits
clock cycle. 8 clock per clock cycle. 4 clock The sample is
cycles are thus required cycles are thus required The sample is serialized b0 processed in
to process the whole to process the whole and processed 4 bits per parallel 8 bits
sample sample clock cycle per clock cycle
Conclusion

 Efficiency of computation
 Slow as its bit serial
 Memory requirements
References

 The role of Distributed Arithmetic in FPGA


based signal processing, www.xilinx.com

You might also like