You are on page 1of 4

New Approach to LUT Implementation and

Accumulation for Memory-Based Multiplication


Pramod Kumar Meher
Communication Systems Department, Institute for Infocomm Research,
Agency for Science, Technology and Research (A*STAR), Singapore.

TABLE I
Abstract—A new approach to look-up-table (LUT) implemen- LUT W ORDS AND P RODUCT VALUES FOR I NPUT W ORD L ENGTH L = 4
tation for memory-based multiplication is presented, where the
memory-size is reduced to half at the cost of some increase in address word stored input product # of control
combinational circuit complexity. The proposed design offers a d2 d1 d0 symbol value x3 x2 x1 x0 value shifts s1 s0
saving of nearly 42% area and 38% area-delay product (ADP) 0 0 0 1 A 0 0 0
at the cost of 6% increase in computational delay for memory- 0 0 1 0 21 × A 1 0 1
based multiplication of 8-bit inputs with 16-bit coefficient. For 000 P0 A
0 1 0 0 22 × A 2 1 0
high-precision multiplication, a shift-save-accumulation scheme
is proposed to accumulate the LUT outputs corresponding to 1 0 0 0 23 × A 3 1 1
the segments of input-operand, which requires nearly 1.5 times 0011 3A 0 0 0
more area, but offers more than twice the throughput and nearly 001 P1 3A 0110 21 × 3A 1 0 1
two-third the ADP of direct shift-accumulation approach. 1100 22 × 3A 2 1 0
0101 5A 0 0 0
010 P2 5A
I. I NTRODUCTION 1010 21 × 5A 1 0 1
As the device scaling has progressed over the last four 011 P3 7A 0111 7A 0 0 0
decades, semiconductor memory has become cheaper, faster 1110 21 × 7A 1 0 1
and more power-efficient. According to the projections of the 100 P4 9A 1001 9A 0 0 0
international technology roadmap for semiconductors (ITRS) 101 P5 11A 1011 11A 0 0 0
[1], embedded memories will continue to have dominating 110 P6 13A 1101 13A 0 0 0
presence in the system-on-chip (SoC), which may exceed
111 P7 15A 1111 15A 0 0 0
90% of total SoC content. It has also been found that the
transistor packing density of SRAM is not only high but also s0 and s1 are control bits of the logarithmic barrel-shifter.
increasing much faster than the transistor density of logic
devices. Moreover, the memory-based structures are more reg- X be an input word to be multiplied with A. Assuming X
ular than the multiply-accumulate structures; and have many to be a positive binary number with word-length L, there can
other advantages, e.g., greater potential for high-throughput be 2L possible values of X, and accordingly, there can be
and reduced-latency implementation and less dynamic power 2L possible values of product C = A · X. Therefore, for
consumption (due to less switching activities for memory read memory-based multiplication, an LUT of 2L words consisting
operations compared to the conventional multipliers). of pre-computed product values corresponding to all possible
Memory-based structures are well-suited for many DSP values of X, is conventionally used. The product-word A · Xi
algorithms, which involve multiplication with fixed set of is stored at the location whose address is the same as Xi for
coefficients. Several architectures have been reported in the lit- 0 ≤ 2L − 1, such that if L-bit binary value of Xi is used
erature for memory-based implementation of DSP algorithms as address for the LUT, then the corresponding product value
involving orthogonal transforms and digital filters [2]–[5]. But, A · Xi is available as its output.
we do not find any work on efficient implementation of lool- Although 2L possible values of X correspond to 2L possible
up-table (LUT) for memory-based computation. In this paper, values of C = A · X, only (2L /2) words corresponding to the
we aim at presenting a design approach for reducing the LUT- odd multiples of A may only be stored in the LUT. One of
size for memory-based multiplication, which could be used the product word is zero while all the rest (2L /2) − 1 are even
efficiently for small input-widths. Besides, we show that the multiples of A which could be derived by left-shift operations
multiplications for large operand-width could be implemented of one of the odd multiples of A. We illustrate this in Table I
by input decomposition using shift-save-accumulation.
ADDRESS

OUTPUT

II. P ROPOSED LUT D ESIGN FOR C ONSTANT L MEMORY‐CORE (W+L)


PORT
PORT

X AX
M ULTIPLICATION (2L WORDS)

The principle of memory-based implementation of multi-


plication is shown in Fig.2. Let A be a fixed coefficient and Fig. 1. Conventional LUT-Based Multiplier.

978-1-4244-3828-0/09/$25.00 ©2009 IEEE 453


Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.
x0 w0 x3 x2 x1 x0 x3
d0 w1
w2 x2
x1 x1
4‐TO‐3 BIT
d1 3‐TO‐8 w3 d0
8 x (W+4) W+4
x2 ADDRESS LINE w4 x0

AND CELL
MEMORY ARRAY
ENCODER ADDRESS w5
DECODER
x3 d2 w6 s0
w7 d1
s0 s1

output, AX
(W+4)‐bit
SHIFTER
BARREL
CONTROL s1

RESET
CIRCUIT
RESET d2

(a) (b) (c)


Fig. 2. Proposed reduced-size LUT architecture for multiplication of W -bit fixed coefficient A with 4-bit input operand, X = x3 x2 x1 x0 . (a) The proposed
LUT-based multiplier. (b) The 4-to-3 bits input encoder. (c) Control circuit for generation of control-word for the two-stage logarithmic barrel-shifter.

for L = 4. At eight memory locations, eight odd multiples, 3-bit address from the input encoder, and generates 8 word-
A × (2i + 1) are stored as Pi for i = 0, 1, 2, · · ·, 7. The even select signals, {wi , for 0 ≤ i ≤ 7}, to select the referenced-
multiples, 2A, 4A and 8A are derived by left-shift operations word from the memory-array. The output of the memory-array
of A. Similarly, 6A and 12A are derived by left-shifting 3A, is either AX or a sub-multiple of AX depending on the value
while 10A and 14A are derived by left-shifting 5A and 7A, of X. From Table I, we find that the LUT output is required to
respectively. The address X = (0 0 0 0) corresponds to A · be shifted through 1 location to left when the input operand X
X = 0, which can be obtained by resetting the memory output. is one of the values {(0 0 1 0), (0 1 1 0), (1 0 1 0), (1 1 1 0)}.
For any value of word-size L similarly, only half of (2L /2) Two left-shifts are required if X is either (0 1 0 0) or (1 1 0 0).
odd multiple values need to be stored in the LUT, while the Only when the input word X = (1 0 0 0), three shifts are
other (2L /2−1) non-zero values could be derived by left-shift required. For all other possible input operands, no shifts are
operations of stored values. Based on the above, an LUT for required. Since the maximum number of left-shifts required on
multiplication of an L-bit input with W -bit coefficient could the stored-word is three, a two-stage logarithmic barrel-shifter
be designed by the following strategy: is adequate to perform the necessary left-shift operations. The
L number of shifts required to be performed on the output of
• A memory-unit of (2 /2) words of (W + L)-bit width
is used to store the odd multiples of A. the LUT and the control-bits s0 and s1 for different values of
• A barrel-shifter for producing a maximum of (L − 1) X are shown Table I. The control circuit [shown in Fig.2(c)]
left-shifts is used to derive all the even multiples of A. accordingly generates these control-bits given by
• The L-bit input word is mapped to (L − 1)-bit address
of the LUT by an encoder. s0 = x0 + (x1 + x2 ) (2a)
• The control-bits for the barrel-shifter are derived by a
s1 = (x0 + x1 ) (2b)
control-circuit to perform the necessary shifts of the LUT
output. Besides, a RESET signal is generated by the same When the input operand word X = (0 0 0 0), the output
control circuit to reset the LUT output when the X = 0. of the memory array is reset by the AND cell, consisting of
(W + 4) AND gates. The output bits of memory-array are fed
The proposed LUT design for L = 4 is shown in Fig.2. as one of the inputs of the AND gates in parallel, while the
It consists of a memory-array of eight words of (W + 4)-bit RESET signal is fed as the other input of all the AND gates.
width and a 3-to-8 line address decoder, along with an AND- When RESET= 0, all the AND gates produce output value 0,
cell, a barrel-shifter, a 4-to-3 bit encoder, and a control circuit. while for RESET= 1, the LUT output is passed unchanged to
The 4-to-3 bit input encoder is shown in Fig.2(b). It receives the barrel-shifter. The reset could alternatively be implemented
a four-bit input word (x3 x2 x1 x0 ) and maps that onto the by a NOR-cell consisting of (W +4) NOR gates, instead of the
three-bit address word (d2 d1 d0 ), according to the relations: AND gates, by using an active-high RESET, where the product
values Pi of i = 0, 1, ··, 7 are stored in complement form.
d0 = (x0 · x1 ) · (x1 · x2 ) · (x0 + (x2 · x3 ) (1a) The control circuit, generates an active-low RESET under the
condition X = (0 0 0 0) according to the logic expression:
d1 = (x0 · x2 ) · (x0 + (x1 · x3 )) (1b)
RESET = (x0 + x1 ) · (x2 + x3 ) (3)
d2 = x0 · x3 (1c)
Proposed LUT design can also be used by a dual-port
The pre-computed values of A × (2i + 1) are stored as Pi memory unit, where the input word-size could be doubled
for i = 0, 1, 2, · · ·, 7 at the 8 consecutive locations of the without increasing the memory space. A dual-port memory-
memory-array as specified in Table I. The decoder takes the based multiplier for 8-bit input is shown in Fig.3.

454
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.
OUTPUT UNIT

x03

x02

x01

x00
2S
RP/2

(2S/2) WORDS) AND SHIFT-ADDER


DUAL-PORT MEMORY UNIT OF
2S RP/2-1

INPUT LOADING UNIT

OUTPUT REGISTER ARRAY


4‐TO‐3 BIT
ENCODER
CONTROL
CIRCUIT
2S

T-BIT REGISTER
&

T-BIT ADDER
T T
2S
2S

RESET

d2

d1

d0
s1
s0
R2
R2
2S

DECODER
ADDRESS
W

PORT‐1

3‐TO‐8
2S

LINE
R1 R1

ACCUMULATOR
w7
w6
w5
w4
w3
w2
w1
w0
AND
CELL
SHIFTER‐1
BARREL

Fig. 4. Memory-based multiplier using shift-accumulation.


MEMORY ARRAY
SHIFT‐ADDER

DUAL‐PORT
8 x (W+4)

registers move across the input loading unit such that during
SHIFTER‐2
BARREL

each cycle the content of (i + 1)-th register is transferred to


Q

the i-th register in parallel. During each cycle, the first register
AND
w7
w6
w5
w4
w3
w2
w1
w0

CELL R1 feeds a segment of 2S-bit sub-word to the pair of ports of


the dual-port memory unit as address-bits.
DECODER
ADDRESS
PORT‐2

3‐TO‐8
LINE

2) The Memory Unit: It contains a dual-port memory array,


(W+8)‐bit output, AX

with a pair of S-bit ports, consisting of (2S /2) words (as


RESET

shown in Fig.3). It uses a shift-adder to left-shift the more


s1
s0

d2

d1

d0

significant LUT-output by S bit locations, and then to add


4‐TO‐3 BIT
ENCODER
CONTROL
CIRCUIT

that to the other LUT-output to produce an output of T =


&

(W + 2S)-bit size (W is the width of the fixed coefficient).


3) Shift-Accumulator-and-Output Unit: The shift-
x13

x12

x11

x10

accumulator, accumulates the successive LUT outputs


whose numerical values successively increase by a factor of
Fig. 3. LUT-based multiplier using dual-port memory-array. Q = (W + 4). 22S . It consists of a T -bit adder followed by a T -bit delay
register. D flip flops of this register are reset for initialization,
III. H IGH -P RECISION M EMORY-BASED M ULTIPLICATION and in each cycle thereafter loaded with the output of the
BY I NPUT-O PERAND D ECOMPOSITION adder. During each cycle the adder receives a T -bit word
from the LUT and the most significant W bits from the delay
The LUT-based multiplier discussed in the section-II cannot
register as operands. The least significant 2S-bits of the delay
be used when the width of input X is large since the memory-
register are transferred to an array of (P/2 − 1) number of
size increases exponentially with operand-width. Therefore,
2S-bit output registers. After P/2 cycles the output register
when the width of input multiplicand X is large, let it be
array holds the 2S(P/2 − 1) least significant bits of product
decomposed into P/2 number of 2S-bit segments of sub-
word, while the delay register contains the most significant
words. Without loss of generality, we can assume L = P S,
T -bits of product. The critical-path of this multiplier is
since the input word can be zero-padded to make an even
TA + TD , where TA and TD are T -bit addition-time and a D
multiple of S (S is assumed to be ≤ 4). We discuss here two
flip-flop delay, respectively. The latency for multiplication of
designs for pipelined memory-based multiplication by such
two L-bit words is L(TA + TD )/2S.
input decomposition. The first one is based on direct shift-
accumulation of successive memory outputs, while for the B. Memory-Based Multiplier using Shift-Save-Accumulator
other, we propose here a shift-save-accumulation approach. The memory-based multiplier using the proposed shift-save-
A. Memory-Based Multiplier using Shift-Accumulator accumulator (SSA) is shown in Fig.5. It consists of five units:
(i) input loading unit, (ii) a memory unit, (iii) a pair of shift-
The memory-based multiplier using shift-accumulator save-accumulators, (iv) a pair of shift-adders and (iv) an output
(shown in Fig.4) consists of (i) an input loading unit, (ii) a shift-adder. Input loading unit is similar to that shown in Fig.4
memory unit, and (iii) a shift-accumulate-and-output unit. except that it has two arrays of S-bits registers, where the
1) The Input Loading Unit: It consists of a set of P/2 upper-array loads the more significant half and lower array
parallel-in parallel-out array of registers of size 2S-bit each. loads the less significant half of X.
The 2S-bit segments of input X are loaded to the P/2 registers The memory unit is a dual-port memory of (2S /2) words
in parallel, such that when the least significant segment is similar to that shown in Fig.3. The structure of shift-save-
loaded to register R1 , the successive registers are loaded with accumulator for S = 2 and L = W = 8 (for accumulation
the successively more significant segments. Content of the of four words) is shown in Fig.5(b). It is a modified form of

455
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.
W+S CARRIES W+L/2

OUTPUT
LOADING
DUAL-PORT SHIFT-SAVE ACCUMULATOR SHIFT-ADDER-1
RP/2 S S
R2 S S

WORD
R1
INPUT

OUPUT
UNIT
SUMS 2L
MEMORY UNIT OF
RP/2 S S R2 S R1 S CARRIES W+L/2 SHIFT-ADDER
(2S/2) WORDS W+S SHIFT-SAVE ACCUMULATOR SHIFT-ADDER-2
SUMS
(a)
OUTPUT BI T S FR O M THE ROM
y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 SHIFTED SUMS AN D CARRIES

H D A D A D A D A D A D A D A D H D H D H D H D H D H D

D D D D D D D D D D D D D D D D

(b)
Fig. 5. Memory-based multiplier using the proposed shift-save-accumulator. (a) The multiplier. (b) The shift-save-accumulator for S = 2 and W = L = 8.
A, H and D stand for a full-adder, half-adder and a D flip-flop, respectively.

carry-save-accumulator where the carries and the sums of the TABLE II


C OMPLEXITIES OF LUT-BASED M ULTIPLIERS FOR L = 8 AND W = 16.
full-adders (A) and half-adders (H) are shifted, respectively,
by 1 and 2 locations towards right, such that all the input-bits LUT design area delay (ns) ADP
arriving at a full-adder or a half-adder are of the same weight. conventional design using dual-port memory 41303.2 5.16 213124.7
proposed design using dual-port memory 23975.7 5.48 131279.3
The two MSBs do not need any adder since they do not receive
Area is in µm2 . Delay=multiplication-time=1 clock-period. ADP:area-delay product.
any sum or carry from any other locations. The input bit at
two locations right to the MSB is fed to a half-adder since TABLE III
it is added with only one more bit. The rest 7 input bits are C OMPLEXITIES OF M ULTIPLIERS FOR L = W = 16 AND S = 4.
fed to seven full-adders. Since the SSA is designed to add implementation scheme area delay (ns) ADP
four words it is required to shift the partial results three times shift-accumulation using proposed LUT 25741.3 10.04 258561.0
and therefore requires six more half-adders for the additions shift-save accumulation using proposed LUT 38026.5 4.66 177203.4
of shifted-out sums and carries. It requires 30 D flip-flops to
store the 14 bits of carries and 16 bits of sums. After 4 shift- SAGE-XTM standard cell library data [7]. The area and delay
accumulations, the SSAs produces two words: where one is complexities of multipliers based on the proposed reduced-
constituted by all the sums and the other by all the carries of LUT design and conventional design are accordingly estimated
half-adders and full-adders. The pair of outputs of each SSAs (Table II), and found that the proposed design involves nearly
are shift-added by a 14-bit adder, where word of carries is 6% more delay, but offers a saving of nearly 42% in area and
left-shifted by one location before being added with the word 38% in area-delay product (ADP) for L = 8 and W = 16. The
of sums. The output of the pair of shift-adders are finally shift- area and time complexities of the multipliers for L = W = 16
added where the less significant word is right-shifted by (L/2) and S = 4 using the proposed SSA scheme and direct-shift-
locations. accumulation are listed in Table III. The SSA scheme is found
For multiplication of two L-bit words each of the SSAs, in to require ≈ 1.5 times more area, but offers more than twice
general, involves (L−1) full-adders, (L/2−S +1) half-adders the throughput and nearly two-third the ADP of direct shift-
and (3L − S) D flip-flops to store the sums and carries. Each accumulation approach. The proposed LUT design is expected
of shift-adder-1 and 2 needs a (3L/2−S)-bit adder and output to be useful for hardware implementation of DSP algorithms.
shift-adder needs a (3L/2)-bit adder. The critical-path of SSA
is TM +TF A +TD , where TF A and TD are the worst-case prop- R EFERENCES
agation delays of a full-adder and a D flip-flop respectively. [1] International Technology Roadmap for Semiconductors. [Online].
TM is the memory-read time. The minimum clock-period of Available: http://public.itrs.net/
the multiplier is max{(L/2S)(TM + TF A + TD ), TA }, where [2] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design automation of the
memory-based VLSI architectures for FIR filters,” IEEE Trans. Consumer
TA is (3L/2)-bit addition-time. Since the output-shift-adder Electronics, vol. 39, no. 3, pp. 619–629, Aug. 1993.
and shift-adders-1 and 2 work in two separate pipeline stages, [3] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “A memory-efficient
additional 3L latches are required before the output-shift- realization of cyclic convolution and its application to discrete cosine
transform,” IEEE Trans. Circuits Syst for Video Technol., vol. 15, no. 3,
adder. It has throughput of one multiplication per clock after pp. 445–453, Mar. 2005.
latency of 3 cycles. [4] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, “Systolic
algorithms and a memory-based design approach for a unified architecture
for the computation of DCT/DST/IDCT/IDST,” IEEE Trans. Circuits
IV. R ESULTS AND D ISCUSSION Syst-I: Regular Papers, vol. 52, no. 6, pp. 1125–1137, June 2005.
[5] P. K. Meher, “Systolic designs for DCT using a low-complexity con-
The area of the dual-port memories and adders along with current convolutional formulation,” IEEE Trans. Circuits & Systems for
their data arrival times are determined by using Synopsys Video Technology, vol. 16, no. 9, pp. 1041–1050, Sept. 2006.
DesignWare 0.18 micron TSMC library [6]. The area and max- [6] “Synposys, DesignWare. Foundry Libraries, Mountain View, CA.”
[Online]. Available: http://www.synopsys.com/
imum propagation delays of MUXes and gates for unit drive- [7] TSMC 0.l8um Process 1.8-Volt SAGE-XTM Standard Cell Library Data-
strength are obtained from TSMC 0.l8µm process 1.8Volt book,Release 4.1. Sunnyvale, CA: Artisan Components, Sept. 2003.

456
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.

You might also like