Professional Documents
Culture Documents
TABLE I
Abstract—A new approach to look-up-table (LUT) implemen- LUT W ORDS AND P RODUCT VALUES FOR I NPUT W ORD L ENGTH L = 4
tation for memory-based multiplication is presented, where the
memory-size is reduced to half at the cost of some increase in address word stored input product # of control
combinational circuit complexity. The proposed design offers a d2 d1 d0 symbol value x3 x2 x1 x0 value shifts s1 s0
saving of nearly 42% area and 38% area-delay product (ADP) 0 0 0 1 A 0 0 0
at the cost of 6% increase in computational delay for memory- 0 0 1 0 21 × A 1 0 1
based multiplication of 8-bit inputs with 16-bit coefficient. For 000 P0 A
0 1 0 0 22 × A 2 1 0
high-precision multiplication, a shift-save-accumulation scheme
is proposed to accumulate the LUT outputs corresponding to 1 0 0 0 23 × A 3 1 1
the segments of input-operand, which requires nearly 1.5 times 0011 3A 0 0 0
more area, but offers more than twice the throughput and nearly 001 P1 3A 0110 21 × 3A 1 0 1
two-third the ADP of direct shift-accumulation approach. 1100 22 × 3A 2 1 0
0101 5A 0 0 0
010 P2 5A
I. I NTRODUCTION 1010 21 × 5A 1 0 1
As the device scaling has progressed over the last four 011 P3 7A 0111 7A 0 0 0
decades, semiconductor memory has become cheaper, faster 1110 21 × 7A 1 0 1
and more power-efficient. According to the projections of the 100 P4 9A 1001 9A 0 0 0
international technology roadmap for semiconductors (ITRS) 101 P5 11A 1011 11A 0 0 0
[1], embedded memories will continue to have dominating 110 P6 13A 1101 13A 0 0 0
presence in the system-on-chip (SoC), which may exceed
111 P7 15A 1111 15A 0 0 0
90% of total SoC content. It has also been found that the
transistor packing density of SRAM is not only high but also s0 and s1 are control bits of the logarithmic barrel-shifter.
increasing much faster than the transistor density of logic
devices. Moreover, the memory-based structures are more reg- X be an input word to be multiplied with A. Assuming X
ular than the multiply-accumulate structures; and have many to be a positive binary number with word-length L, there can
other advantages, e.g., greater potential for high-throughput be 2L possible values of X, and accordingly, there can be
and reduced-latency implementation and less dynamic power 2L possible values of product C = A · X. Therefore, for
consumption (due to less switching activities for memory read memory-based multiplication, an LUT of 2L words consisting
operations compared to the conventional multipliers). of pre-computed product values corresponding to all possible
Memory-based structures are well-suited for many DSP values of X, is conventionally used. The product-word A · Xi
algorithms, which involve multiplication with fixed set of is stored at the location whose address is the same as Xi for
coefficients. Several architectures have been reported in the lit- 0 ≤ 2L − 1, such that if L-bit binary value of Xi is used
erature for memory-based implementation of DSP algorithms as address for the LUT, then the corresponding product value
involving orthogonal transforms and digital filters [2]–[5]. But, A · Xi is available as its output.
we do not find any work on efficient implementation of lool- Although 2L possible values of X correspond to 2L possible
up-table (LUT) for memory-based computation. In this paper, values of C = A · X, only (2L /2) words corresponding to the
we aim at presenting a design approach for reducing the LUT- odd multiples of A may only be stored in the LUT. One of
size for memory-based multiplication, which could be used the product word is zero while all the rest (2L /2) − 1 are even
efficiently for small input-widths. Besides, we show that the multiples of A which could be derived by left-shift operations
multiplications for large operand-width could be implemented of one of the odd multiples of A. We illustrate this in Table I
by input decomposition using shift-save-accumulation.
ADDRESS
OUTPUT
X AX
M ULTIPLICATION (2L WORDS)
AND CELL
MEMORY ARRAY
ENCODER ADDRESS w5
DECODER
x3 d2 w6 s0
w7 d1
s0 s1
output, AX
(W+4)‐bit
SHIFTER
BARREL
CONTROL s1
RESET
CIRCUIT
RESET d2
for L = 4. At eight memory locations, eight odd multiples, 3-bit address from the input encoder, and generates 8 word-
A × (2i + 1) are stored as Pi for i = 0, 1, 2, · · ·, 7. The even select signals, {wi , for 0 ≤ i ≤ 7}, to select the referenced-
multiples, 2A, 4A and 8A are derived by left-shift operations word from the memory-array. The output of the memory-array
of A. Similarly, 6A and 12A are derived by left-shifting 3A, is either AX or a sub-multiple of AX depending on the value
while 10A and 14A are derived by left-shifting 5A and 7A, of X. From Table I, we find that the LUT output is required to
respectively. The address X = (0 0 0 0) corresponds to A · be shifted through 1 location to left when the input operand X
X = 0, which can be obtained by resetting the memory output. is one of the values {(0 0 1 0), (0 1 1 0), (1 0 1 0), (1 1 1 0)}.
For any value of word-size L similarly, only half of (2L /2) Two left-shifts are required if X is either (0 1 0 0) or (1 1 0 0).
odd multiple values need to be stored in the LUT, while the Only when the input word X = (1 0 0 0), three shifts are
other (2L /2−1) non-zero values could be derived by left-shift required. For all other possible input operands, no shifts are
operations of stored values. Based on the above, an LUT for required. Since the maximum number of left-shifts required on
multiplication of an L-bit input with W -bit coefficient could the stored-word is three, a two-stage logarithmic barrel-shifter
be designed by the following strategy: is adequate to perform the necessary left-shift operations. The
L number of shifts required to be performed on the output of
• A memory-unit of (2 /2) words of (W + L)-bit width
is used to store the odd multiples of A. the LUT and the control-bits s0 and s1 for different values of
• A barrel-shifter for producing a maximum of (L − 1) X are shown Table I. The control circuit [shown in Fig.2(c)]
left-shifts is used to derive all the even multiples of A. accordingly generates these control-bits given by
• The L-bit input word is mapped to (L − 1)-bit address
of the LUT by an encoder. s0 = x0 + (x1 + x2 ) (2a)
• The control-bits for the barrel-shifter are derived by a
s1 = (x0 + x1 ) (2b)
control-circuit to perform the necessary shifts of the LUT
output. Besides, a RESET signal is generated by the same When the input operand word X = (0 0 0 0), the output
control circuit to reset the LUT output when the X = 0. of the memory array is reset by the AND cell, consisting of
(W + 4) AND gates. The output bits of memory-array are fed
The proposed LUT design for L = 4 is shown in Fig.2. as one of the inputs of the AND gates in parallel, while the
It consists of a memory-array of eight words of (W + 4)-bit RESET signal is fed as the other input of all the AND gates.
width and a 3-to-8 line address decoder, along with an AND- When RESET= 0, all the AND gates produce output value 0,
cell, a barrel-shifter, a 4-to-3 bit encoder, and a control circuit. while for RESET= 1, the LUT output is passed unchanged to
The 4-to-3 bit input encoder is shown in Fig.2(b). It receives the barrel-shifter. The reset could alternatively be implemented
a four-bit input word (x3 x2 x1 x0 ) and maps that onto the by a NOR-cell consisting of (W +4) NOR gates, instead of the
three-bit address word (d2 d1 d0 ), according to the relations: AND gates, by using an active-high RESET, where the product
values Pi of i = 0, 1, ··, 7 are stored in complement form.
d0 = (x0 · x1 ) · (x1 · x2 ) · (x0 + (x2 · x3 ) (1a) The control circuit, generates an active-low RESET under the
condition X = (0 0 0 0) according to the logic expression:
d1 = (x0 · x2 ) · (x0 + (x1 · x3 )) (1b)
RESET = (x0 + x1 ) · (x2 + x3 ) (3)
d2 = x0 · x3 (1c)
Proposed LUT design can also be used by a dual-port
The pre-computed values of A × (2i + 1) are stored as Pi memory unit, where the input word-size could be doubled
for i = 0, 1, 2, · · ·, 7 at the 8 consecutive locations of the without increasing the memory space. A dual-port memory-
memory-array as specified in Table I. The decoder takes the based multiplier for 8-bit input is shown in Fig.3.
454
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.
OUTPUT UNIT
x03
x02
x01
x00
2S
RP/2
T-BIT REGISTER
&
T-BIT ADDER
T T
2S
2S
RESET
d2
d1
d0
s1
s0
R2
R2
2S
DECODER
ADDRESS
W
PORT‐1
3‐TO‐8
2S
LINE
R1 R1
ACCUMULATOR
w7
w6
w5
w4
w3
w2
w1
w0
AND
CELL
SHIFTER‐1
BARREL
DUAL‐PORT
8 x (W+4)
registers move across the input loading unit such that during
SHIFTER‐2
BARREL
the i-th register in parallel. During each cycle, the first register
AND
w7
w6
w5
w4
w3
w2
w1
w0
3‐TO‐8
LINE
d2
d1
d0
x12
x11
x10
455
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.
W+S CARRIES W+L/2
OUTPUT
LOADING
DUAL-PORT SHIFT-SAVE ACCUMULATOR SHIFT-ADDER-1
RP/2 S S
R2 S S
WORD
R1
INPUT
OUPUT
UNIT
SUMS 2L
MEMORY UNIT OF
RP/2 S S R2 S R1 S CARRIES W+L/2 SHIFT-ADDER
(2S/2) WORDS W+S SHIFT-SAVE ACCUMULATOR SHIFT-ADDER-2
SUMS
(a)
OUTPUT BI T S FR O M THE ROM
y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 SHIFTED SUMS AN D CARRIES
H D A D A D A D A D A D A D A D H D H D H D H D H D H D
D D D D D D D D D D D D D D D D
(b)
Fig. 5. Memory-based multiplier using the proposed shift-save-accumulator. (a) The multiplier. (b) The shift-save-accumulator for S = 2 and W = L = 8.
A, H and D stand for a full-adder, half-adder and a D flip-flop, respectively.
456
Authorized licensed use limited to: Gheorghe Asachi Technical University of Ia¿i. Downloaded on March 29,2024 at 14:05:09 UTC from IEEE Xplore. Restrictions apply.