You are on page 1of 20

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 40 (2007) 74–93


www.elsevier.com/locate/vlsi

Architecture design of a coarse-grain reconfigurable


multiply-accumulate unit for data-intensive applications$
K. Tatas, G. Koutroumpezis, D. Soudris, A. Thanailakis
VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, 12 Vas. Sofias Str.,
67100, Xanthi, Greece

Abstract

A run-time reconfigurable multiply-accumulate (MAC) architecture is introduced. It can be easily reconfigured to trade bitwidth for
array size (thus maximizing the utilization of available hardware); process signed-magnitude, unsigned or 2’s complement data; make use
of part of its structure or adapt its structure based on the specified throughput requirements and the anticipated computational load. The
proposed architecture consists of a reconfigurable multiplier, a reconfigurable adder, an accumulation unit, and two units for data
representation conversion and incoming and outgoing data stream transfer. Reconfiguration can be done dynamically by using only a
few control bits and the main component modules can operate independently from each other. Therefore, they can be enabled or disabled
according to the required function each time. Comparison results in terms of performance, area and power consumption prove the
superiority of the proposed reconfigurable module over existing realizations in a quantitative and qualitative manner.
r 2006 Elsevier B.V. All rights reserved.

Keywords: MAC; Array multiplier; Carry-save adder; Reconfigurable architecture

1. Introduction [2], REMARC [3], MorphoSys [4], Pleiades [5], PipeRench


[6], KressArray [7]. These implementations are coarse-grain
Reconfigurable hardware has been for years synon- reconfigurable architectures and tackle mainly Digital
ymous with FPGAs. These devices offer bit-level (fine- Signal Processing (DSP)/multimedia issues. Therefore, the
grain) reconfigurability and have recently become viable design of efficient coarse-grain reconfigurable modules is
alternatives to ASICs for a number of applications. Still, critical for realizing modern applications. For that
the accelerating need for high performance, due to complex purpose, reconfigurable modules have been designed either
and real-time applications, for instance a large number of for the above platforms or for non-specific platforms [8].
new wireless communication standards and multimedia In this paper, a reconfigurable multiply-accumulate unit
algorithms, combined with tight time-to-market con- (MAC) is introduced and its architecture design presented
straints and the need for high flexibility due to evolving in detail. It resolves the design conflict between versatility,
standards, has led to the emergence of another type of area, and computation speed, and makes it possible to
reconfigurable architectures: Coarse-grain reconfiguration build a feasible and highly flexible processor with multiple
has proven a viable solution to such problems leading to multipliers and adders for data intensive applications. It
the beginning of reconfigurable computing. Reconfigurable consists of a reconfigurable multiplier, a reconfigurable
computing has flourished recently due to many efforts from adder and an accumulation module.
both academia and industry such as Garp [1], MATRIX The proposed architecture is composed mainly of three
parts: a reconfigurable multiplication unit, a reconfigurable
$
addition unit and an accumulation unit. The first two
This work was partially supported by the project AMDREL IST-
components are properly combined into a reconfigurable
2001-34379 funded by EC.
Corresponding author. Tel.: +30 25410 79961; fax: +30 25410 79545. MAC unit, but they can function totally independent from
E-mail address: ktatas@ee.duth.gr (K. Tatas). each other. This ability is dependent on two configuration

0167-9260/$ - see front matter r 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.vlsi.2006.02.011
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 75

bits. The processor that will be presented here can operate implemented in FPGA devices in order to perform
two 64-bit items, or eight 32-bit items, or thirty-two 16-bit comparisons with existing architectures in terms of
items, or one hundred and twenty eight 8-bit items, or five performance, area and power consumption.
hundred and twelve 4-bit items, with items in unsigned, or A similar architecture was proposed in [8] where inner
signed, or 2’s complement representation. product computation was also used for constructing the
Reconfiguration can be done dynamically using a few reconfigurable multiplier unit but the proposed work is
control bits. There are many possible combinations of the focused on the ability to differentiate its throughput rate
reconfiguration parameters, yielding a number of different and disable part of the architecture for reasons of power
modes of operation. saving. The design proposed in this paper is regular, it can
The proposed architecture can be reconfigured with be easily pipelined, and most parts of the network are
respect to the following parameters: (i) the bitwidth of the symmetric and repeatable. There is also another important
operands, (ii) the arithmetic system of computations difference form [8]: the operands can be of different
(unsigned, sign-magnitude and 2’s complement) and (iii) arithmetic representation (integers or double precision 2’s
various throughput rates. More specifically, the main complement numbers, signed or unsigned), although the
characteristics of the proposed reconfigurable unit are: reconfigurable multiplier is designed for operating with
unsigned digit numbers.
(i) It can compute the array of results for any item
Furthermore, a reconfigurable multiplication unit with
precision b (here b ¼ 4–64 bits). The multiplication
similarities to the proposed multiplication module, was
part is implemented through the application of a
introduced in [11], but it offers less flexibility.
recursive decomposition of a partial product matrix,
The remaining part of the paper is organized as follows:
repeated use of small m  m (m ¼ 4) multipliers and
Section 2 provides a detailed description of the MAC
small adder circuit blocks. The accumulation part is
architecture and its components, Section 3 presents the
comprised of an adder, which is implemented by using
obtained experimental results and the paper concludes with
combined blocks of carry-save adders and basic
Section 4.
addition properties, and an accumulation module.
(ii) It includes the appropriate circuitry to transfer the
2. Reconfigurable multiply-accumulator architecture
input data to the MAC and ensure that the results will
exit correctly. In this approach we assume a 32-bit bus.
2.1. General overview
(iii) With an addition of a few multiplexers and the
appropriate reconfiguration bits, part of the architec-
The architecture of the proposed reconfigurable multi-
ture may be disabled according to the application,
plication accumulation component (MAC) is shown in
saving power because of the idle state of the unused
Fig. 1. It consists of: (i) the interface unit (IU), (ii) the
part.
arithmetic selection unit (ASU), (iii) the multiplication unit
(iv) It allows different arithmetic representations (sign-
(MU), (iv) the addition unit (AU), (v) the accumulation
magnitude, unsigned or 2’s complement) [9].
unit (AcU) and (vi) the reconfiguration logic (RL). In
(v) An appropriate number of pipeline stages can be
particular, the first two units include two similar logic
bypassed through multiplexers, trading throughput
modules, which manipulate the incoming operands, for
rate for power consumption whenever input data rates
multiplication and addition and the outgoing stream,
are low. This technique achieves high performance and
i.e., the final result, and performs the data transfers from/
very low energy dissipation by adapting its structure to
to the bus. The ASU also includes a third module for internal
computational requirements over time [10].
data conversion. Generally, the bus bitwidth and the input/
Apart from the attractive hardware features, the output data wordlength are different and thus, a control
proposed reconfigurable architecture exhibits additional logic circuit within IU is designed to tackle this issue. The
characteristics: (a) Low-power consumption: depending on ASU performs conversion between different arithmetic
the MAC specifications (e.g., wordlength), parts of the representations (i.e., 2’s complement and unsigned num-
MAC architecture can be powered-down dynamically and bers). The third unit, MU, performs multiplication for
(b) Reusable IP blocks: depending on the MAC functional various wordlengths of input data and its architecture
specifications, in case for instance a certain design crucially affects the overall latency, power consumption
parameter (e.g., arithmetic representation) is fixed, while and area. Here, we provide the MU design for multi-
another one (such as wordlength) is not, the designer can plications with multiplicand and multiply of 64 bits.
instantiate in an automatic fashion the components in the Internally, it uses as basic module a 4  4 array multiplier
RTL VHDL code that are related to the desired [9], assuming unsigned digit operations. The next unit is the
functionality. Consequently, the user can realize alternative AU, which performs addition for various numbers of
reconfigurable modules with the precise flexibility required summands and wordlengths. The AU is designed for
by the application constraints. summands of 8, 16, 32 and 64 bits. The whole structure is
The proposed architecture was described in VHDL and based on carry-save architecture of multiple summands for
strictly for demonstration and evaluation purposes was increased performance. The fifth unit performs a single
ARTICLE IN PRESS
76 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

input IU, which distributes them to the following unit in


Input BUS Output
the appropriate way that will be explained later. The
Data Data following unit, the ASU, converts the representation of the
Input
and Timing
Signals Output numbers to an unsigned form in order to be multiplied at
IU Control IU the MU after they pass demultiplexer (DeMUX) 1 and 2.
Signals
Main
After the multiplication, the products pass through
ASU
RL Clock ASU
DeMUX 3 and end at the intermediate ASU that converts
(input data) (output data) them to 2’s complement representation if the original data
Reset had a signed or a 2’s complement representation (after their
DeMUX1 SRM MUX4 accumulation they will take their original form). This
conversion occurs because the addition of numbers in 2’s
complement representation (although they contain their
MUX1

MUX2
sign) is the same as the one of the unsigned representation.
MUX3 The next step is the product accumulation. More specifi-
cally, since more than three products are available
AU
AcU
simultaneously and should be accumulated, the accumula-
tion occurs in two phases for performance reasons. First,
MUX5
the products are added in the AU and the result is
DeMUX4

accumulated in the AcU with the sum from the previous


time step. In the case of a 64  64 multiplication, where
only one product in each clock cycle is computed, AU is
DeMUX2 not used. The products after the ASU, for internal data,
DeMUX5
are accumulated in AcU after passing through the right
output of DeMUX5 and MUX5. After that, the sum passes
ASU through MUX3 and MUX4 and enters into the ASU, for
(intermediate
outgoing data, whose representation is changed back to the
DeMUX3

module )
MU
one of the incoming data.
The second operation of the proposed reconfigurable
structure is multiplication. The flow of the operation is
similar to the previous one till the output of the MU where
Fig. 1. Architecture of the MAC.
the products pass through the right output of DeMUX3
and MUX2 and MUX4, and reach the output ASU
accumulation of two summands. Finally, the RL sends to module.
IT, ASU, MU, AU and the multiplexers and demultiplex- The third operation is addition. The incoming data pass
ers the appropriate bitstream to perform the reconfigura- through IU, ASU, DeMUX1 and MUX1 to end up in AU.
tion and the appropriate signals for the corresponding The ASU converts only the signed numbers to two 2’s
operation. To realize the various modes of configuration, complement. The sum of the AU may or may not be
we use eight control signals, which are internally decoded accumulated at the AcU and exit after it has been
to 18 control signals (recon1,y, recon14, rec1, rec2 and converted back (if the original data were signed numbers)
rec3—Section 2.7). The detailed architecture description in ASU.
and function of the proposed MAC unit are provided in The last operation of the reconfigurable MAC is the use
the following sections. of the ASU modules in order to change the representation
In addition, the proposed module exhibits low-power of the digital input data. To bypass the main MAC
features by disabling and bypassing a subset of registers structure the data go through DeMUX1, DeMUX2,
based on the specified throughput requirements and the MUX2 and MUX4.
anticipated computational load. This information is
application-dependent and can be inferred at high abstrac-
tion levels. It also can be combined with voltage scaling to 2.2. Multiplication unit
further increase energy efficiency.
2.2.1. Basic concept
2.1.1. Description of the MAC operation The unsigned multiplication can be expressed by:
The proposed structure of the MAC can support four
different functions. The first one is multiplication-accumu- 0 1
lation and the remaining ones are derived from the X
15 X X B X C
Pi ¼ Ai Bj 2iþj ¼ @ Ai Bj 2iþj A,
independent operation of the units that compose the MAC.
i¼0 0pi;jp7 0pm;np1 4mpip3þ4m
The first function (multiplication-accumulation) uses all 4npjp3þ4n

the available units (Fig. 1). The input data pass from the (1)
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 77

1st 4x4 multiplier

st
2 4x4 multiplier

rd
3 4x4 multiplier

th
4 4x4 multiplier

summation of three 8-bit


Adding a 4-bit number into a 2-bit one numbers

16-bit product

Fig. 2. An 8  8 product decomposed to 4  4 partial products.

where A and B are two 8-bit numbers, A ¼ A7 ; . . . ; Ai ; of the multiplication of the whole component and if there is
. . . ; A0 and B ¼ B7 ; . . . ; Bj ; . . . ; B0 , Eq. (1) implies that an another pipeline stage or not (recon3). Each m2k  m2k
8  8 multiplication is equivalent to four 4  4 multi- multiplier includes an output register/latch of m2k+1 bits
plications where m and n are integers A{0, 1}. and a m2k+2–m2k+1 MUX.
We build partial products of 4  4 matrices, which are to The structure of Fig. 4 can multiply two operands of
compose an 8  8 partial product matrix as shown in maximum m2k bitwidth. If we would like to perform
Fig. 2. The weighted bits of the four products of the four m2k1  m2k1 multiplications the m2k1  m2k1 Blocks
multipliers are added by two adders to result in the final would be used. The four of them can multiply four pairs of
product of the 8  8 multiplier. multiplicands. We have the choice of multiplying less pairs
of operands, i.e., only two pairs of operands, by setting the
2.2.2. Construction of an 8  8 multiplier (Block 8  8) two out of the four Blocks in ideal state. The IU (Section
Fig. 3 depicts the architecture of an 8  8 multiplier. The 2.3) does not transfer data to these two blocks. Thus half of
four products of the 4  4 multipliers pass through an array the structure is used in order to lower the power dissipation
of demultiplexers controlled by the recon1 bit. In case that of the unit.
recon1 is set low a Block 8  8 produces four 8-bit For example, if we use a 4  4 multiplier (m ¼ 4) as the
products, while if recon1 ¼ ‘1’ two adders sum the four smallest multiplier component, Fig. 4 takes the form of the
products deriving a 16-bit product. The three operands circuit of Fig. 3 (Block 8  8). Block 8  8 (m ¼ 4, k ¼ 1)
8-bit carry-save adder that consists of an array of full differs from the others in two ways. It has two and not
adders (compressing the three input bits to two [9]) and three kinds of outputs and is comprised of four standard
half adders, and a ripple carry adder. We sought for a fast multipliers and not from reconfigurable Blocks. The rest of
three-operand adder and we chose a carry-save adder, the Blocks (for k41) follow the architecture of Fig. 4.
which, for multiple operands, is faster and more efficient in Thus, to build a 16  16 reconfigurable multiplier (m ¼ 4,
area coverage than a carry-lookahead adder [9]. The two k ¼ 2), four components of Block 8  8, one 3-input 16-bit
registers provide the ability to complete the 8  8 multi- carry save adder, one 8-bit ripple carry adder, one 32-bit
plication in two pipeline stages. The two multiplexers, register and a few additional multiplexers and demultiplex-
controlled by recon7, can disable the registers thus reducing ers (controlled by recon1,2 and 3 bits) are needed. It is easy
the pipeline stages. to verify that the architecture can produce the product of
two numbers of 16  16 bits by setting the recon1 ¼ 1 (of
2.2.3. Designing a multiplier with larger input size Block 8  8, Fig. 3) and recon2 ¼ 1 (of multiplier 16  16,
The approach (conceptually) described above for the Fig. 4); or the array of four 16-bit products by setting
decomposition of an 8  8 partial product matrix into four recon1 ¼ 1, recon2 ¼ 0; or the array of 16 8-bit products by
4  4 ones can be applied for larger numbers of operands setting both recon1 ¼ recon2 ¼ 0. In our case the first
recursively. Fig. 4 shows the general scheme of the extension of Block 8  8 was a structure similar to that of
m2k  m2k multiplier’s structure. The m2k for kX1 (kAN) Fig. 4 with k ¼ 2, a 16  16 reconfigurable multiplier as the
parameter defines the type of the blocks that are used and one that was described previously. The other two exten-
the multiplier itself. The m parameter defines the smallest sions will be for k ¼ 3 and 4, a 32  32 and a 64  64
multiplier and the k parameter defines the number of the reconfigurable multiplier, respectively. This will happen if
extension that will be made. The recon1 bits control the we use the proposed 16  16 reconfigurable multiplier as
kind of the multiplication that is performed and the Block 16  16 for building the 32  32 reconfigurable
pipeline stages of the blocks, while recon2 defines the kind multiplier and the proposed 32  32 reconfigurable
ARTICLE IN PRESS
78 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

4-bit inputs 4-bit inputs 4-bit inputs 4-bit inputs

Multiplier Multiplier Multiplier 8 Multiplier


4 x 4 bit 4 x 4 bit 4 x 4 bit 4 x 4 bit
8
1
input (8 bits) input (8 bits) input (8 bits) input (8 bits)
Demultiplexer 1 ch Demultiplexer 2 ch Demultiplexer 3 ch Demultiplexer 4 ch
output(0) output(1) output(0) output(1) output(0) output(1) output(0) output(1)

15 down to 8 recon1
23 down to 16 recon
31 do wn to 24 7 down to 0
recon7
2

Multiplexer 1 ch Register 32 bits reset


(2x32 to32) clock

31 down to 28
7 down to 4 1
27 down to 24
23 do wn to 16 15 down to 8 LSB
2 reset
3 down to 0
clock
input(2) input(1) input(3) input(2) input(1)
Adder (4+4) Add er (8+8+8)
output (4 bits) out put (2 bits) output (8 bits)
15 down to 12 11 down to 4
1

16 16
m m m m Register 16 bits reset
Multiplexer 1 ch
4.4 4.3 4.2 4.1 clock
(2x16 to 16)
m8

4 8-bit outputs 16-bit outputs

Fig. 3. The reconfigurable 8  8 multiplier (Block 8  8).

multiplier as Block 32  32 for building the 64  64 these two sub-networks, we have the complete input
reconfigurable multiplier, respectively. The final reconfi- network. Once the inputs are duplicated and distributed
gurable multiplier (64  64) can produce an array of 256 to the array of base multipliers, the corresponding levels of
products of 8-bit items in one pipeline cycle; an array of 64 reconfigurable modules as described in the previous
products of 16-bit items in two pipeline cycles, an array of sections are able to perform the pre-selected computation
16 products of 32 bits in three pipeline cycles, an array of in pipeline to yield the desired results. In Fig. 5(c) the 4-bit
four products of 64-bits in four pipeline cycles, and a 128- and 8-bit items have number indicators. If we want to
bit product in five pipeline cycles. In general, a reconfigur- multiply two 4-bit numbers we place them in the right
able m2k  m2k multiplier based on m  m multipliers can switches. There is a simple algorithm that finds which the
execute either, 40m2k  m2k multiplications, or 41m2k1  appropriate switches are, e.g., we place X1, Y1 to switches
m2k1 multiplications, or 42m2k2  m2k2 multiplications, 17, 1, respectively. The switches are numbered from the
or 43m2k3  m2k3 multiplications, y, or 4k1m21  m21 right to the left of the page (Fig. 5). For the rest we have
multiplications or 4km  m multiplications. what can be seen in Table 1.
Similarly, the network for a reconfigurable array of
2.2.4. The duplication and distribution networks multipliers of s ¼ 64 (s ¼ maximum input item bitwidth),
To duplicate and distribute the input data stream to the m ¼ 4 can be constructed, with a total of 512 5-state
array of multipliers, we need the following two additional switches connected to 256 4  4 multipliers.
simple subnetworks shown in Fig. 5, (for the 16  16
reconfigurable multiplier): (1) The input duplication net- 2.3. Interface unit
work with the reconfiguration switches as shown in Fig.
5(a) and (b). It duplicates data received from ports in one 2.3.1. Main part
of the three levels according to the reconfiguration options The IU distributes an input array pair with h ¼ (s/b)2
and consists of a fixed wire net and an array of b-bit items in parallel into the MAC and transfers the
reconfigurable switches with three switch states shown in results out of it. It comprises of two modules, the one is for
Fig. 5(b) and 2. The duplicated input data distribution the incoming data (input IU) and the other is for the
network, as shown in Fig. 5(c), is a fixed wire net, which outgoing (output data). The whole idea is based on correct
permutes data to the base 4  4 multipliers. By connecting synchronized multiplexers and trade transfer cycles (the
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 79

2k+1m-bit inputs 2k+1m-bit inputs 2k+1m-bit inputs 2k+1m-bit inputs

input (2k-1m bits) recon input (2k+1m bits) recon


BLOCK(2k-1 m)x(2k-1m) 3 reset BLOCK(2k-1 m)x(2k-1m) 1 reset
output(2k+1m bits) output (2km bits) clock k+1 k
output(2 m bits) output (2 m bits) clock

recon
recon 1
2k-2
input (2k+1m bits) recon input (2k+1m bits) recon recon 2
1 2k
BLOCK(2k-1 m)x(2k-1m) 4 reset BLOCK(2k-1 m)x(2k-1m) 2 reset
k+1 k k+1 k
output(2 m bits) output (2 m bits) clock output(2 m bits) output (2 m bits) clock
2k m 2 km 1
2 km
2k m recon3
input (2km bits) input (2km bits) input (2km bits) input (2km bits)
Demultiplexer 1 ch Demultiplexer 2 ch Demultiplexer 3 ch Demultiplexer 4 ch
output(0) output(1) output(0) output(1) ) output(0) output(1) output(0) output(1)
k
3.5*2 k m-1 d o wn to 3*2 m k
2 k+2 m-1 down LSB 2 m-1
down
to 3.5*2 k m 2 2k+1 m -1 d o wn to 2 k m to reset
2 k-1 m k-1
2 m-1 clock
input(2) input(1) input(3) input(2) input(1) do wn to 0
m m
Adder (2k-1m+2k-1m) Adder (2km+2km+2km)
4.3 4.1 k-1 k
m m output (2 m bits) output (2 bits) output (2 m bits)
4.4 4.2 1.5*2km-1 down to 2k-1m
2k+1m-1 down to 1.5*2km 1
m m m m
8.4 8.3 8.2 8.1 2 k+1 m 2k+1 m
3*2k m-1 down to
4 2 k+1 m- 4 2 km-bit 2 k+1 m Multiplexer 1 ch Register 2k+1m bits reset
bit outputs outputs clock
(2x2k+1m to 2k+1m)
m 16
k+2 k+1
3*2 m-bit output 2 m-bit output

Fig. 4. The reconfigurable 2 m  2km multiplier (Block 2km  2km).


k

Fig. 5. (a)–(c) The input duplication–distribution sub-networks for the 16  16 reconfigurable multiplier.
ARTICLE IN PRESS
80 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

time needed for a single, 32-bit word to be transferred provides each Block 1 module with two sequential data
from/to the bus) for width of registers. In this approach, we words, because Clk 4 is four times slower than Clk and or it
assume a 32-bit input/output bus. provides only the right Block 1 with data. The signal of
The two basic components for the implementation of the MUXa enables each time these blocks to accept or not
input IU are blocks 1 and 2. In Fig. 6, block1 is depicted. accept data by triggering their blk_select input. A period of
rclk_1 is the complement of clock signal Clk, which the bus Clk 4 is the time needed to load the two block_64 with data.
uses to transfer the data. clk_2 is a signal of twice the The output of each Block 1 is distributed to two
period of clk and the clk_3 is a signal of twice the period of demultiplexers. Their input (a 64-bit word in each) is
clk_2. The two 32-bit registers are provided with new data driven to the multiplier and or to the next stage of the
by the demultiplexer only when they are triggered by the network. The next stage is similar but simpler than the first.
AND gates. This means that the 32-bit words that enter As we see the components of this stage are twice the size of
into the component are loaded sequentially on the registers the ones in Block 1. Block 32 is similar to Block 2 with
and in every two clock cycles the 64-bit register is loaded. m ¼ 64. Recon2 is the signal that determines if four pairs of
Thus, two sequential numbers transferred from the bus are 32-bit numbers will be multiplied or continue in the same
forming a single 64-bit number. The blk_select signal way as before, to the next level for more pairs of numbers
enables, in a way that later will be explained, this block of fewer bits.
(block 1). Block 2 has similar operation, but it uses the The reverse procedure is followed after the data are
clk_2 instead of the clk_3 signal and it does not use the processed, so that they can be driven back to the bus. This
blk_select signal at all. procedure is implemented in the output IU. The new block
In Fig. 7, the whole IU for incoming data is shown. The that corresponds to Block 2 is Block a2, as illustrated in
signal Clk 2 has twice the period of signal Clk, Clk 4 has Fig. 8.
twice the period of Clk 2 and so on. Rclk, Rclk 2, Rclk 4,
Rclk 8, Rclk 16, Rclk 32 are the complementary signals of
Clk, Clk 2, Clk 4, Clk 8, Clk 16, Clk 32 accordingly. 2.3.2. Partial use of the structure-power reduction
DeMUX2 distributes the data among the two Block 1 With the addition of a few multiplexers and the
modules and is controlled by MUXa. The demultiplexer appropriate reconfiguration bits, part of the structure can
be disabled. A fraction of the processor’s capabilities are
traded for flexibility and power consumption reduction
Table 1 because of the idle state of the unused part.
Wiring of the distribution sub-network In this paper two examples are given for demonstration
purposes: (i) In Fig. 7, MUXa controls DeMUX3. If
Pairs X, Y X1, Y1 X2, Y2 X3, Y3 X4, Y4
Rec3 ¼ ‘1’, the DeMUX 3 passes its input to the right
Switches 17, 1 18, 5 21, 2 22, 6
Pairs X, Y X5, Y5 X6, Y6 X, Y 7 X, Y 8 Block 1. Therefore, half the circuit is not in use, leading to
Switches 19, 9 20, 13 23, 10 24, 14 no switching activity of its signals, meaning power
Pairs X, Y X9, Y9 X10, Y10 X11, Y11 X12, Y12 consumption reduction. MUXb and MUXc are providing
Switches 25, 3 26, 7 29, 4 30, 8 the new control signals (with half the period from before),
Pairs X, Y X13, Y13 X14, Y14 X15, Y15 X16, Y16
for the register and the multiplexers and (ii) in Fig. 7, if
Switches 27, 11 28, 15 31, 12 32, 16
Rec1 ¼ ‘1’, the data from the bus pass to a demultiplexer.

Input 32 bits
input (32 bits)
Demultiplexer ch
output 1 (32 bits each) output 2
32 bits 32 bits
1 bit
input reset input reset
Register 32 bits Register 32 bits clk_2
output clock output clock
blk_select

32 bits 32 bits clk_3

64 bits rclk_1
1 bit
input 64 bits reset rst
Register 64 bits
output 64 bits clock 1 bit

Output 64 bits

Fig. 6. Block 1: first basic component of the interface unit for input data.
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 81

Input m bits

input m bits reset 1 bit Reset


Register output m bits clock

Demultiplexer 1 input (m bits) ch 1 bit Rec 1


output 1 (m bits each) output 2
m bits
m bits
Demultiplexer 2 inpu t (m bits) ch 2 bits Rec 2
input m bits reset outpu t 1 (m bits each) outpu t 2 output 3
Register output m bits clock
m bits
Mux. a m bits Output a
m bits
Demultiplexer 3 input (m bits) ch Clk 4
1 bit
output 1 (m bits each) output 2 Clk 2
'1' Clk
C
R 3
Rec
Input m bits 1 Input m bits 1
2 1 bit 2 Mux. b
Block 1 3 Block 1 3
4 4
Output 2m bits 5 Output 2m bits 5

Rclk
Demultiplexer 4 inpu t (2m bits) ch Demultiplexer 5 inp ut (2m bits) ch
output 1 (2m bits each) ou tput 2 output 1 (2m bits each) outpu t 2
2m bits Recon 1
2m bits
Clk 8
Output 1 2m bits 2m bits
Rclk 4
Outpu t 2 . . Rclk 2
Recon 2
Outpu t 3 . . Clk 16
Recon 8
Outpu t 4 . . Recon 3
16m bits 16m bits Clk 32
Mux. f
Input 16m bits 1 Input 16m bits 1
Recon 4
Block m/8 2 Blockm/8 2
Output 32m bits 3 Output 32m bits 3 Clk 64
Rclk 32
Output 5
32m bits Mux. rf Rclk 16
32m bits

Fig. 7. Interface unit.

This unit is controlled through the two bits of signal Rec2 before it enters the multiplier and a circuit that performs
and passes its data directly to the multiplication part the opposite operation after the multiplication.
considering them four as pairs of 4-bit numbers (Rec2 The above-mentioned circuit (for the incoming data)
¼ ‘‘00’’), or two pairs of 8-bit numbers (Rec2 ¼ ‘‘01’’) and forms blocks that process a 128-bit input each, and one of
a pair of 16-bit numbers (Rec2 ¼ ‘‘10’’). In both examples a these blocks is shown in Fig. 9. The signals recon1, recon2,
large part of the multiplier has been rendered inactive, recon3 and recon4 determine if this input is a group of
leading to great power reduction if it is required. thirty two 4-bit items or sixteen 8-bit items or eight 16-bit
items or four 32-bit items or two 64-bit items, respectively.
2.4. Arithmetic selection unit Thus, if we choose to use unsigned numbers, the input
ASU is bypassed by the first demultiplexer (both recon5
The ASU is comprised of three modules. The first is for and recon6 should be kept low). If the input data will be
the incoming data, the second for the internal and the third signed numbers, then they will be converted to unsigned by
for the outgoing data. The easy conversion of a binary setting their sign bit ‘0’ (at the first group of demultiplexers)
number to its corresponding 2’s complement one and register the 32 sign bits in a separate unit till the data
(by inverting it and adding ‘1’) and of a signed to an are processed. These bits are available at the output of the
unsigned number (by removing and storing its sign), input module for the rest of the ASU modules. In the case
provides the opportunity to use different arithmetic of 4-bit items, the 127th bit, 123rd bit, 120th bit, etc., of the
representation systems on the same processor. This is register’s output, which are the sign bits of each 4-bit item,
possible simply by using the appropriate circuit for are set to ‘0’. Thus, these numbers are processed like
converting the data into a plain unsigned binary number unsigned ones. A similar procedure is followed for all the
ARTICLE IN PRESS
82 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

Input 2m bits other groups of items, setting each time the appropriate
rst bits.
input 2m bits reset
Register 2m bits C lk The multiplexers that are used for setting signs are
output 2m bits clock controlled by five single bits, which are derived from the
combinational logic.
m bits m bits 1 b it The signs of the numbers are registered and at the
appropriate time they are examined through an XOR gate
input 1 (m bits each) input 2 (at the output unit, or at the intermediate unit) and if they
Multiplexer ch
output m bits are different (the one is positive and the other negative) the
result is ‘1’ (negative number) otherwise (both positive,
Output m bits both ‘0’ or both negative ‘1’) the result is ‘0’ (positive
Fig. 8. Block a2: Basic component of the interface unit for output data. number). The output of the XOR gate is connected to the

Input 128 bits


reset
Register 128 b its reset 1 bit
clock
clock 1 bit

Register output 1 bit


Mux. 32 Mux.31 Mux. 2 Mux. 1
(122 down to 120)
….. ….. 3 bit
‘0’

recon 6
Demultiplexer input 128 bits ch 1 b it
output 1 128 bits output 2 128 bits
128bits
128bits recon 5
...5 1 b it
...4 1 bit
ssrecon ...
...3
5 b its
...2
...1

NOT Gate output


(122 down to 120) "3 down to 0" bits
Mux. 32 Mux. 31 Mux. 2 Mux. 1

….. ….. 3 bit


LSB

Decoder input 128 bits input


128 bits output 128 bits 4 b its MSB
128 b its recon
1,2,3,4
input 1 128 bits input 2 128 bits
Adder output 128 bits

Mux. 32 Mux. 31 Mux. 2 Mux. 1

….. ….. 3 bit

128 b its

input 1 128 bits input 2 128 bits ch


Multiplexer output 128 bits
Sign bits
(32=>1) Output 128 bits

Fig. 9. Arithmetic selection unit for input data.


ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 83

appropriate result bit which represents the sign of the 1 1 1 Carry bits
number. The sign register module (SRM) of Fig. 1 0 1 0 1
performs the simple operation of registering the values of 1 1 1 0 Summands 1 1 1 1
the signs before they are used at the output or the 0 0 1 0 0 1 0 1
+ 1 0 0 0 1 1 1 0 example
intermediate module. a
0 1 1 1 0 1 Sum 0 0 1 0
If we want to use numbers in 2’s complement representa- 1 0 0 0
(a)
tion we set both recon5 ¼ ‘1’ and recon6 ¼ ‘1’. The items 1 1 0 1
1 1 1 1
are inverted and ‘1’ is added to each of them (the addition 1 0 1 1 example
1 1 0 1 0 1 0 0
is made to the appropriate word derived from the decoder 1 0 1 1
b
+ 0 0 0 0
according to the bitwidth of the operands; e.g., 0 1 0 0
‘‘000100010001000100010001000100010001000100010001- + 0 0 0 0 01 1 1 0 0 1
(c)
0001000100010001000100010001000100010001000100010- 0 1 1 1 0 0
0010001000100010001000100010001’’ for the 4-bit operands), (b)
modified in this way to plain unsigned numbers. The
Fig. 10. (a), (b) Binary addition of four 4-bit numbers (the carry bits are
inverse procedure is modifying them back to the original shown), and (c) Binary addition of eight 4-bit numbers. The first four are
representation system after the multiplication and addition those from example (a) and the last four from example (b).
and it is accomplished in the output module.
The three modules of the ASU are based on the above Step 1 Step 2
procedures. The first one (for the input data) differs from 0 1 0 1 Sum1
the other two in one element. It cannot convert the 1 1 1 0
0 0 1 0 01 1 1 0 1 1 1 1
processed numbers to signed representation because it can 1 0 0 0 0 1 1 1 0 1
only remove the signs from the numbers as can be seen in + 0 1 1 1 0 0
1 1 0 1 Sum2
Fig. 9. The other two modules can perform exactly the 0 1 1 1 0 0 1
1 0 1 1 01 1 1 0 0
opposite and their structure uses a number of XOR gates
0 1 0 0
(as was mentioned above) in order to resume the sign of the 0 0 0 0
numbers. This unit (ASU) is totally independent from the
rest unit of the MAC; therefore it can be used with other Fig. 11. The addition of eight numbers in two steps.
modules, for examples reconfigurable adders or even be
disabled at all. The maximum width of the results of these additions is
When the incoming data stream of the reconfigurable 6- and 7-bit, respectively.
MAC consists of numbers in 2’s complement, its repre- Fig. 11 shows an example of the first property of
sentation must be changed. First the input data ASU addition that is used in this paper (property 1). The
changes the 2’s complement representation into unsigned addition of eight 4-bit (the same as in Fig. 10(c)) numbers is
digit representation. Then the data are multiplied and the performed in two steps. The first step includes two
products enter the internal data ASU, in order to be additions, those of Fig. 10(a) and (b). The second step is
converted to 2’s complement representation before they are the addition of the two results of the previous step and
added. If the original data were not in 2’s complement their sum is the same as the one shown in the example of
representation but in simple signed representation, then the Fig. 10(c).
flow of the processing would be the following: Firstly they Two conclusions can be derived from this example: The
would be altered to unsigned data, Secondly they would be first is that an addition can be split into a number of
multiplied, Thirdly they would be changed to 2’s comple- additions, with less summands, that are performed in
ment representation, Fourthly they would be added and parallel and vise-versa. The second conclusion is that the
accumulated and Finally changed into the original signed addition in this way can be performed in more than one
representation. In case that we want to perform additions step, leading to its easy pipelining.
or accumulations, incoming data in 2’s complement form In Fig. 12 the addition of the same four numbers as in
will be unchanged while signed incoming data will be Fig. 10(a) is performed in two steps (property 2). Each
changed in 2’s complement representation before the number is split in two parts (of maximum and minimum
addition. magnitude) and in the first step, the four 2-bit numbers of
minimum and maximum magnitude are added (Fig. 12(a)
and (b)). The two separate results are added in the second
2.5. Addition unit step, considering that there is a 2-bit magnitude difference
between them. The final result is the same as in Fig. 10(a).
2.5.1. Theoretical basis Therefore the conclusion is, as in the previous property,
This unit is based upon two fundamental properties of that an addition can be split into/composed of a number of
addition. Both are demonstrated using simple examples. In additions, with the same number of summands, but of a
Fig. 10, two examples of a binary addition of four 4-bit smaller word width. The same opportunity for pipelining is
numbers and an example of eight 4-bit numbers are shown. also met in this property.
ARTICLE IN PRESS
84 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

Maximum
Maximum magnitude bits
magnitude bits 2-bit
magnitude
difference
1 1 1
Sum1 0 1 0 1
0 1 0 1
1 1 1 0 1 1 1 0 0 1 1 0
0 0 1 0 0 0 1 0 + 0 1 0 1
1 0 0 0 + 1 0 + 0 0
0 1 1 1 0 1
0 1 1 0 0 1 0 1
Step2
Step1

Fig. 12. The addition of four 4-bit numbers in two steps.

The adder that is presented here is based on a carry-save according to our demands. In this paper, the only demand,
structure [9]. The carry-save addition follows a simple idea: which we had to take under consideration, is the fact that
(a) more than two summands are used, while the bits of the the MU can produce, simultaneously, at least four
same magnitude are added simultaneously, (b) in each step, products that have to be added.
the less significant bit of the sum and a number of carry bits The architecture of each block is based on the structure
are calculated from the bits of the same magnitude (either illustrated in Fig. 13. Every 3-bit addition can be replaced
the bits of the summands, or the carry bits of a previous with a full-adder and every 2-bit addition with a half-
step). This concept is explained thoroughly in Fig. 13 by adder. The structure of the block is depicted in Fig. 14.
using an example of an addition of four 4-bit numbers (the Further, we have to notice that the carry-save block
numbers of Fig. 10a). In Fig. 13 the carry, sum, and (Csv-Block) can be separated in two parts. The first one is
summand bits are symbolized with the letters c, s and op, the first four columns of full and half adders, and the
respectively. The circled sum bits comprise the final sum second one is the fifth column of full and half adders and
bits. the OR gate. The first part depends on the number of the
This example uses either 3- or 2-bit additions because in bits of the summands. The second part calculates the sum
the hardware implementation, either full- or half-adders of the final carry bits that come up from the last column of
are used. The first step is to calculate the carry bits and the the first part. If the summands had 8 instead of 4 bits, then
sum bit of the columns (Fig. 13 b1.1, b2.1, b3.1, b4.1). the columns of the first part would be eight and the second
Each column consists of the bits of the summands, which part would differ accordingly.
are of the same magnitude (Fig. 13a). In this step the first In order to increase the number of the summands, we
three bits (sm0_0, sm1_0 and sm2_0) are added and they make use of property 1. Therefore, to construct an adder of
produce a carry (c0_0) and a sum (s0_0) bit. Then, the eight 4-bit summands, two Csv-Blocks and an adder of two
other bit of the summands (sm3_0) and the s0_0 are added 6-bit numbers are used. Each Csv-Block adds four 4-bit
(shaded part), producing a carry (c1_0) and a sum (s1_0) numbers and their sums are added at the 6-bit adder. We
bit. As it concerns the first column, the calculations are can create bigger structures by using more Csv-Blocks or
completed and the sum bit (s1_0) is the less significant bit different Csv-Blocks (in summand width and number).
of the final sum. To calculate the second bit of the final Furthermore, if we want to add more summands, we can
sum, three more bits are added, the sum bit s1_1 of the use more Csv-Blocks and replace the 6-bit adder with the
second column and the carry bits c0_0, c1_0 of the first appropriate unit. For example, in order to add sixteen 4-bit
column (Fig. 13(b)). This action results to another carry numbers, four Csv-Blocks are used, and the simple 6-bit
(c2_1) and sum (s2_1) bit. adder is replaced with a carry-save adder that adds four 6-
The same procedure continues for all the columns. In bit numbers.
Fig. 13(b) is illustrated in detail, which carry and sum bits The operand width can be enlarged easily by using
are added and how many additions take place in order to property 2. Fig. 15 is an example of how to create an adder
calculate each and every one of the final sum bits. of four 8-bit numbers out of two Csv-Blocks and a 6-bit
adder. We can notice that the 6-bit adder can have a
2.5.2. Basic concept simpler structure from a normal one, because it adds a 6-bit
We use blocks of carry-save adders, which operate as it number and a 2-bit number. That is why, we use only its six
was explained in Section 2.5.1 (Fig. 13). They are used bits and not all seven. The seventh sum bit (the most
either to increase the number of the summands or to significant bit) of the 6-bit adder can be set only to ‘‘0’’
enlarge the width of the summands (by using the first or the because of the summands that are used (the one has six
second property of addition, respectively). bits, and the other two). Also, notice that only the two
The blocks of carry-save adders, that are used here, can most significant bits, of the first Csv-Block, affect the six
add four 4-bit numbers. Still, their size can be modified most significant bits of the final sum. This is due to the fact
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 85

columns
4th 3rd 2n d 1st
0 1 0 1 sm0
Four 1 1 1 0 sm1
summands 0 0 1 0 sm2
1 0 0 0 sm3
(a)
th rd nd
4 column 3 column 2 column 1st column

carry sum summands


c s sm c s sm c s sm (c) (s) (sm)
0 (sm0_3) 1 (sm 0_2) 0 (sm0_1) 1 (sm0_0)
0 1 1 0 1 0 0 1 0 (sm1_0)
1 (sm1_3) 1 (sm 1_2) 1 (sm1_1)
(c0_3) (s0_3) 0 (sm2_3) (c0_2) (s0_2) 0 (sm 2_2) (c0_1) (s0_1) 1 (sm2_1) (c0_0) (s0_0) 0 (sm2_0)
1 (sm3_3) 0 (sm 3_2) 0 (sm3_1) 0 (sm3_0)
1 0 0 0 0 0 0 1
(c1_3) (s1_3) (c1_2) (s1_2) (c1_1) (s1_1) (c1_0) (s1_0)

0 (c0_3) 0 (s1_3) 0 (s1_2) 0 (s1_1)


1 (c1_3) 1 (c0_2) 1 (c0_1) 0 (c0_0)
+ 0 (c2_3) + 0 (c1_2) + 0 (c1_1) + 0 (c1_0)
0 1 (s2_4) 0 1 (s2_3) 0 1 (s2_2) 0 0
(s2_1)
(c2_4) (c2_3) (c2_2) (c2_1)
1 (s2_3)
1 (c2_4) 1 (s2_2)
0 (c2_2)
+ 0 (c3_3) + 0 (c3_2) + 0 (c2_1)
0 1 0 1 0 1
(c3_4) (s3_4) (s3_3) (c3_2) (s3_2)
(c3_3)

0 (c3_4)
+ 0 (c2_4)
0 0
(c3_5) (s3_5)

(b)

s3_5 s3_4 s3_3 s3_2 s2_1 s1_0

F5 F4 F3 F2 F1 F0 Final sum

(c)

Fig. 13. (a) Four 4-bit numbers separated in four columns according to their magnitude, (b) a detailed example of a carry-save addition.

that these are the overflow bits of sum_0. The general want to add four 16-bit numbers and not 8-bit numbers as
observation is that the least significant bits of the first sum in Fig. 15, the only thing that we have to do is use twice the
(sum_0) are also the least significant bits of the final sum structure of Fig. 15 as it was described earlier. The Csv-
and only its overflow bits would affect the most significant Blocks are replaced by components same as the whole
bits of the final sum, whichever the summands width is. For structure of Fig. 15 and the 6-bit adder is replaced with a
example, if the Csv- Block in Fig. 15 was a carry-save adder 10-bit adder. The Most Significant Bit of the 10-bit adder
of eight 8-bit numbers (its output is a 11-bit number and can be set only to ‘‘0’’ due to the used summands, i.e., the
the three of them are the overflow) only the three most one has six bits, and the other two.
significant bits are processed further, in order to calculate
the 11 most significant bits of the 19-bit result. 2.5.3. Operation of addition unit
If we want to handle summands of, greater width we can The synthesis of the two structures that were described
use the same method as in Fig. 15. For example, when we earlier, the one for increasing the number of the summands
ARTICLE IN PRESS
86 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

4 th input 3 rd input 2 nd input 1 st input


input1 sm3_(3=>0) sm2_(3=>0) sm1_(3=>0) sm0_(3=>0)
sum
FA input2
carry input3
Full-adder

sum input1
HA
carry input2
Half-adder FA FA FA FA
s0_3 s0_2 s0_1 s0_0
c0_2
c0_3 c0_1 c0_0
HA HA HA HA
s1_3 s1_2 s1_1
c1_3 c1_1
c1_2
c2_4
FA FA FA FA
c2_3 c1_0
s2_4 s2_3 c2_2 s2_2 c2_1
c3_4
HA c3_3 FA HA input4 input3 input2 input1
c3_2
s2_1 4 bits 4 bits
s3_4 s3_2 4 bits 4 bits
s3_3 s1_0
s3_5 Csv-Block

6 bits Sum
Sum (5=>0)

Fig. 14. The structure of the basic carry-save block (Csv-Block). It can add four 4-bit numbers and has a 6-bit output.

input8 input7 input6 input5 input4 input3 input2 input1 unit. Therefore it must be able to calculate four kinds of
4 bits 4 bits 4 bits 4 bits
sums: either four 64-bit numbers, or sixteen 32-bit
4 bits 4 bits 4 bits 4 bits numbers, or sixty-four 16-bit numbers or two hundred
and fifty-six 8-bit numbers.
Csv-Block Csv-Block We will explain the structure of the AU in two steps. The
Sum_0 [0:5] first one clarifies all the kinds of the additions that this unit
Sum_1 [0:5] computes except the 256 8-bit number addition, which is
Sum_0 [4:5]
explained in the second step.
Sum_0[0:3]
6-bit adder
The first part of AU is depicted in Fig. 16. The main part
of the additions takes place in the two ReconCsv-Blocks,
Sum [4:9] Sum [0:3]
RCB, (part of the first section of the AU). These blocks are
based on the carry-save addition and have similar function
Fig. 15. An adder of four 8-bit summands, made of two Csv-Blocks and a to the Csv-Blocks of Fig. 14. The combination block, CB,
simple adder of two 6-bit summands. is part of the second section of the AU that combines the
outputs of the RCBs. The eight 10-bit outputs of the RCBs
will be used by the CB in order to compute the 64-bit
and the one for increasing the wordlength of the addition. Similarly, the 12-bit and the 10-bit outputs are
summands, yields the reconfigurable AU. It is composed used for the 32-bit addition and the 16-bit, 8-bit additions,
of two modules. The first module comprises identical units respectively. The two configuration bits select which type
in which the main part of the additions takes place. They of addition will take place at the AU each time. Signal cb1
are totally independent from each other and that is why is the one of these configuration bits. If we wish our
they have the ability to perform different kinds of additions architecture to operate in a power save mode, we can set
at the same time. Their architecture is based on carry save the one of the RCBs to an idle state and perform
addition. The second module is designed in order to operations using only one RCB, with reduced throughput.
combine the outputs of the units of the first section Fig. 17 shows a ReconCsv-Block. Each Csv-Block is
according to the two addition properties that we explained based on the architecture of a carry-save adder. Csv-Block
earlier. Its structure differs according to the number of 4-8bit (CsB 4-8) can add four 8-bit summands. It is similar
these identical units and their structure. to the Csv-Block of Fig. 14 but with three differences: The
AU can operate as a simple adder or can act as part of an first one is that it can add eight and not only four
accumulation operation of the output of the multiplication summands. Therefore, it has not four columns of full and
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 87

512-bit input 512-bit input


bits (cb 1,2,3,4)

ReconCsv-Block ReconCsv-Block
(RCB) (RCB)

Four 12- Four 10- Four 12- Four 10-


14-bit 14-bit
bit bit bit bit
output output
outputs outputs outputs outputs

10-bit 14-bit 10-bit 14-bit 10-bit 10-bit 10-bit 12-bit 10-bit 12-bit 10-bit 12-bit 10-bit 12-bit
input8 input2 input7 input1 input6 input5 input4 input4 input3 input3 input2 input2 input1 input1

cb1 Combination Block (CB)

22-bit Output1 66-bit Output3 36-bit Output2

Fig. 16. The first part of the addition unit.

bits (cb 1,2,3,4)


512-bit input half adders as the Csv-Block but eight columns and the
respective columns for the addition of the final carry bits.
The second difference is that the half-adders of these eight
1st Csv-Block 4-8bit
columns have been replaced by full-adders, so that we can
1st Demultiplexer1 add one more bit in each column and alter the addition of
2nd Csv-Block 4-8bit four summands to an addition of five. The fifth summand
is the result of a previous CsB 4-8 block that we want to use
2nd Demultiplexer1
in order to accomplish an eight-summand addition instead
3rd Csv-Block 4-8bit of a four-summand one. With this modification we have a
3rd Demultiplexer1 more fine-grain structure of less area and latency than the
th
4 Csv-Block 4-8bit one that was described in Section 2.5.2, which employs a
separate adder. The third difference is that demultiplexers,
Demultiplexer2 registers and multiplexers are used in order to alter the
number of the pipeline stages as we explained in the
1st Csv-Block 16-8bit multiplication unit. The first four columns of full-adders of
CsB 4-8 (which process the four less significant bits of the
4th Demultiplexer1 summands) comprise the first pipeline stage and the other
four columns of full-adders (which process the four most
2nd Csv-Block 16-8bit significant bits of the summands) comprise the second. The
third stage is the part that adds the final carry bits of the
5th Demultiplexer1
Csv-Block. Two configuration bits (cb3,4) determine the
number of the pipeline stages.
3rd Csv-Block 16-8bit
The Csv-Block 16-8bit (CsB 16-8) is similar to CsB 4-8
with the difference that it adds 16 summands instead of
6th Demultiplexer1
four. It is almost four times larger than CsB 4-8 and they
are both based on exactly the same architecture concept.
Four 12- Four10- The Demultiplexer Block 1 (DB1) accepts the output of the
14-bit
bit bit CsB that precedes and transfers it either to the CsB that
output outputs
outputs follows (in order to use it to compute an addition of more
summands) or to the output of the RCB. The first three
Fig. 17. The ReconCsv-Block. DB1 s are controlled by configuration bit 1 (cb1). When it
ARTICLE IN PRESS
88 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

is set to ‘‘0’’, the output of each CsB 4-8 is transferred to The combination block (CB) accepts the outputs of the
the next CsB 4-8 and when it is set to ‘‘1’’, the sum of a four two RCBs (except from the four 12-bit output of the left
8-bit numbers is available at the right output of RCB. The one) and computes the final result. The outputs correspond
last three DB1s are controlled by configuration bit 1 (cb2). to the inputs from right to left (as shown in Fig. 16).
When it is set to ‘‘0’’, the output of each CsB 16-8 is Therefore, by using the two addition properties, which
transferred to the next CsB 16-8 and when it is set to ‘‘1’’, were explained earlier, the CB includes three structures that
the sum of a sixteen 8-bit numbers is available at the middle are depicted in Fig. 18. The architecture of Fig. 18(a)
output of RCB. The sixth DB1 if cb2 ¼ ‘‘0’’, provides the computes the sum of sixty-four 16-bit summands, the
left output of RCB with the sum of sixty four 8-bit architecture of Fig. 18(b) computes the sum of sixteen 32-
summands. The DB2 is controlled by both cb1 and 2. bit summands and the one of Fig. 18(c) computes the sum
When cb1 ¼ cb2 ¼ ‘‘0’’ it transfers its input to the next CsB of four 64-bit summands. These three circuits are combined
16-8, when cb1 ¼ ‘‘1’’ and cb2 ¼ ‘‘0’’ it provides the right to one that is illustrated in Fig. 19. The multiplexers and
output of RCB with the sum of four 8-bit summands and the demultiplexers select which addition will take place. If
finally when cb1 ¼ ‘‘0’’ and cb2 ¼ ‘‘1’’ it provides the cb1 ¼ ‘‘0’’ we will have the addition of the four 64-bit
middle output of RCB with the sum of sixteen 8-bit numbers and if cb1 ¼ ‘‘1’’ we will have in outputs 1 and 2
summands. the result of the other two additions.

input4 input3 input2 input1 12 bits


each

Input3 Input3 Input1 Input1


[8:11] [0:7] [8:11] [0:7]
14 bits input2 input1
each
12-bit adder 12-bit adder
Input1 Input1
[8:13] [0:7] [8:19] [0:7] [8:19] [0:7]

14-bit adder 20 bits 20 bits

[8:21] [0:7]
20-bit adder [16:19]
[0:15]
22-bit Output1
[16:35] [0:15]

(a) (b) 36-bit Output2

10 bits
input8 input7 input6 input5 input4 input3 input2 input1 each

Input7 Input7 Input5 Input5 Input3 Input3 Input1 Input1


[8:9] [0:7] [8:9] [0:7] [8:9] [0:7] [8:9] [0:7]

10-bit adder 10-bit adder 10-bit adder 10-bit adder

[8:17] [0:7] [8:17] [0:7] [8:17] [0:7] [8:17] [0:7]

18 bits 18 bits 18 bits 18 bits

18-bit adder [16:17] 18-bit adder [16:17]


[0:15] [0:15]
[16 : 33] [0:15] [16: 33] [0:15]

34 bits 34-bit adder 34 bits

[32: 33] [0: 31]


[32: 65] [0: 31]

66-bit Output3
(c)

Fig. 18. The auxiliary circuits for: (a) a sixty four 16-bit summands addition, (b) a sixteen 32-bit summands addition and (c) a four 64-bit summands
addition.
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 89

10-bit 14-bit 10-bit 14-bit 10-bit 10-bit 10-bit 12-bit 10-bit 12-bit 10-bit 12-bit 10-bit 12-bit
input 8 input 2 input7 input1 input6 input5 input4 input4 input3 input3 input2 input2 input1 input1

cb1
MUX6 MUX5 MUX4 MUX3 MUX2 MUX1

Mux5 Mux5 Input5 Input 5 Mux3 Mux3 Mux1 Mux1


[8:13] [0:7] [8:9] [0:7] [8:11] [0:7] [8:11] [0:7]

14-bit adder 10-bit adder 12-bit adder 12-bit adder

22 bits [8:21] [0:7] [8:17] [0:7] [8:19] [0:7] [8:19] [0:7]

cb1 Demux1
18 bits 20 bit s 20 bit s
18 bit s
18-bit adder [16:17] 20-bit adder [16:19]
[0:15] [0:15]

[16:33] [0:15] [16:35] [0:15]


36 bit s

Demux2 cb1
34 bits 34-bit adder

[32:33] [0:31] 32 bits

[32:65] [0:31]

22-bit Output1 66-bit Output3 36-bit Output2

Fig. 19. The Combination Block (CB).

We will now explain how AU computes the sum of the input2 input1
256 8-bit summands. The main idea is to combine the sums
of four groups of sixty four 8-bit numbers. So, there are 14 bits 14 bits
three implementation options. The first one is to use the
two 14-bit outputs from the RCBs of Fig. 16 and another 15-bit adder2 15-bit adder1
two similar outputs from two Csv-Blocks 64-8. These Csv- 15 bits 15 bits
Blocks are simple carry-save structures that add sixty-four
8-bit numbers. The four 14-bit outputs are combined with a 15-bit register 15-bit register
circuit of three carry-lookahead adders. The two of them clk
add two 14-bit numbers each, and their sums are added at a
15-bit adder, yielding the final sum of the 256 numbers. 15-bit adder3
The second implementation option is to perform this 16-bit Output
addition in two cycles. This means that we will use only the
two RCBs of Fig. 16 and the circuit of Fig. 20, but twice. Fig. 20. The auxiliary circuit for the second implementation of the second
part of AU.
The circuit in Fig. 20 practically accumulates the outputs of
the RCBs before they reach the AcU (notice that the 16th
bit of adders 1 and 2 is not used because they both add a In this way we accomplished the accumulation of 256 8-
14-bit and a 15-bit number). The third implementation bit products that we intended for the beginning practically
option is to use only the two RCBs of Fig. 16 and an 16-bit by using twice the AcU and by increasing the number of
adder. Each of them computes the sum of sixty four 8-bit the pipeline stages. Each implementation has obvious
numbers. If we add their outputs in a 14-bit adder we will advantages and drawbacks.
have the sum of a hundred and twenty eight 8-bit numbers. The previous description of AU was made under the
This sum will be transferred to the AcU where it will be condition of a certain MU of m ¼ 4, k ¼ 4. The general
accumulated with the next or previous result of a hundred structure of an AU is based on the same methodology. The
and twenty eight 8-bit numbers addition. All three options 2km for kX1 (kAN) parameter defines the type of the
have been considered, obtaining three alternative imple- RCBs or of the entire AU. The m parameter defines
mentations. the smallest wordlength and the k parameter defines the
ARTICLE IN PRESS
90 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

biggest wordlength of the summands that will be added and (iii) recon6 for 2’s complement number representation
their plurality as well. The size of the compositive CsBs can (recon5 ignored); (iv) recon7, recon8, recon9, recon10 for
differ according to the application demands. Here we used regulate pipeline stages; (v) recon11, recon12, recon13,
CsBs with an 8-bit width (m ¼ 4) and of four or 16 recon14, for selecting the function of the whole structure
summands. Generally, a reconfigurable structure, similar (addition, multiplication, multiplication-accumulation and
to AU, of 2m wordlength and 4k summands can execute an simple change of number representation); (vi) rec1, rec2
addition of 4k 2m-bit numbers, of 4k1 22m-bit numbers, (2 bits) for enabling minimum configuration (a 32-bit
y,of 42 2k1m-bit numbers and of 4 2km-bit numbers. product, or two 16-bit products or four 8-bit products);
and (vii) rec3 for disabling half of the structure (in case of
2.5.4. The duplication and distribution networks less than half of maximum throughput).
To duplicate and distribute the input data stream to the Depending on the application specifications and the
AU we need similar networks to those of the MU that were proposed multiplier operate in many different modes of
described in Section 2.2.4. The only difference is that in this reconfiguration, choosing the suitable combination from
case there will be four-state duplication switches because of the internal control signals.
the four different types of inputs.
3. Experimental results
2.6. Accumulation unit (AcU)
For experimental purposes the units of the proposed
The AcU is a simple circuit that is shown in Fig. 21. It MAC were described in VHDL and mapped into a number
comprises of a two-summand adder and a register. The of Xilinx FPGA devices. This does not mean that the
register holds the latest sum of the adder and transfers it to developed coarse-grain architecture is meant for FPGA
the adder to add it to the next available input. In our fine-grain implementations, we simply used an FPGA
structure we chose a carry-lookahead adder because it is prototype to evaluate its performance. We performed
one of the fastest architectures. The maximum sum that measurements considering a 64  64 reconfigurable multi-
exits from the AU (and must be accumulated) has a width plier. Specifically, the Input IU consists of 6150 CLBs or
of 66 bits and that would have been the wordlength of the the total equivalent gate count is 122,277. The correspond-
summands of the adder. But the adder that is used adds ing measurements for Output IU are: 5463 CLBs and
two numbers of 128 bits because it is the widest, in number 105,432 gates. Each ASU consists of 16 identical blocks
of bits, product. This product does not pass through the each of which includes 334 CLBs or a 5719 gate count. The
AU because it is produced once in each time step. RL consists of 463 CLBs and 9432 gates. Apparently, the
most hardware complex unit is the MU. Below, we provide
a number of measurements concerning this part.
2.7. Reconfiguration logic (RL)
Extensive comparative study of the proposed MAC unit,
the multiplication unit and the addition unit with
The RL receives as inputs the main clock signal, the reset
corresponding non-reconfigurable units was performed
signal, and 10 internal control signals and performs two
and the results are shown in Tables 2, 3 and 4.
main functions: (i) internal clocks generation and
In particular, Table 2 provides comparison results
(ii) decoding of control signals. The input 8-bit reconfi-
between the proposed reconfigurable MAC and a non-
guration stream is decoded to 18 control bits whose roles
reconfigurable one. Both architectures were implemented
are: (i) recon1, recon2, recon3, recon4 for control bitwidth;
on the same Xilinx VirtexII FPGA device. Mega-opera-
(ii) recon5 for signed or unsigned number representation;
tions per second (MOPS) is defined as the number of
operations multiplied by the maximum clock frequency
Input
divided by the operation latency in number of cycles. It can
be seen that the area complexity and maximum frequency
of the reconfigurable MAC are smaller than the corre-
reset sponding numbers of the conventional MAC. However, the
reset proposed MAC can manipulate various numbers of
clock
Register clock operands (i.e., 8, 32, 128 and 256 in contrast to the fixed
number 2 of the conventional one. Therefore, the proposed
MAC exhibits extremely high MOPS in comparison to the
conventional one. It is obvious that our architecture is
Carry Lookahead extremely efficient for a big number of operands and as it
Adder concerns the area coverage, it is comparable with the non-
reconfigurable MAC if all the functions that our archi-
Output tecture supports are required. For example, if the proposed
MAC is configured to a wordlength of 8 bits, it can
Fig. 21. The accumulation unit. perform 64 operations at a hardware cost of 130,729
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 91

Table 2
Comparison between the proposed reconfigurable MAC and a non-reconfigurable MAC

Reconfigurable structure Non-reconfigurable MAC

Operand bitwidth 4 8 16 32 4 8 16 32
] slices 6716 (1st implementation) 10 26 50 147
6013 (2nd implementation)
] eq. gates 130,729 (1st implementation) 4310 4598 5174 19,400
112,123 (2nd implementation)
Clock frequency (MHz) 16.935 16.152 13.996 12.591 154.488 126.167 89.710 57.541
] operands 512 128 32 8 2 2 2 2
MOPS 4335.36 1033.73 2223.94 50.36 154.49 126.17 89.71 57.54

Table 3
Hardware complexity (area) comparison of a dedicated multiplier and the proposed multiplication unit

Dedicated multiplier Multiplication unit

] bits CLBs ] eq. gates MOPS ] bits CLBs ] eq. gates MOPS

88 53 1232 1  34.4 88 112 1823 64  54/2


16  16 208 4648 1  18.7 16  16 638 9105 16  54/3
32  32 811 17,767 1  16 32  32 2713 39,245 4  54/4
64  64 3268 69,609 1  13.1 64  64 13,501 202,609 1  54/5

Table 4
Hardware complexity (area)-frequency comparison between carry-save adders and a proposed reconfigurable structure (RS)

] summands–bitwidth/ Reconfigurable structure Conventional carry-save adders


summand
] slices ] eq. gates Clock frequency ] slices ] eq. gates Clock frequency
(MHz) (MHz)

4–8 37 615 138.543 37 615 138.543


64–8 1010 13,113 39.858 864 10,666 53.291
32–16 867 10,818 53.285 843 10,578 58.792
256–8 3690 (1st implementation) 40,217 24.675 3632 38,731 32.475
1959 (2nd implementation, 2 22,624 27.424
pipeline cycles)
64–16 1809 22,709 30.450 1683 20,865 46.013
16–32 822 10,626 55.661 788 10,182 70.279
4–64 343 5510 59.102 289 4871 64.470

equivalent gates, while conventional MACs an area cost of comparison, and the results are shown in Table 3. The only
64  4598 gates. parameter that our design is less efficient is the area
Table 3 shows comparison results among a dedicated coverage and because the architecture of [11] was specially
array multiplier and the MU for various bit number of designed and characterized on the Xilinx Virtex device.
input operands in terms of ] CLBs ] equ. Gates and The next experimental step was to perform a comparison
MOPS. Thus, a standard multiplier is better than a between a conventional carry-save adder and the proposed
reconfigurable one. However, taking into account the reconfigurable carry-save adder structure. Table 5 shows
throughput rate (number of results per cycle), the conclu- the comparison results for various bitwidths of input
sion is that for a large number of results, the proposed operands and various numbers of operands in terms of
multiplier is faster than a dedicated one and comparable in slices, equivalent gates and clock frequency. It can be seen
area coverage for the same number of parallel products. that the conventional carry-save adder is better than the
Also, it is much more efficient on these parameters if we reconfigurable one in terms of area coverage and frequency
consider that the MU includes the operation of five types of operation when considering a single function. For this
(i.e., 4  4, 8  8, 16  16, 32  32, 64  64) of multipliers. comparison we had separated the uniform AU in the
In [11], a multiplier that resembles the proposed Block appropriate parts in order to achieve a fair comparison.
32  32 was presented, providing a good opportunity for However, taking into account the area coverage, the
ARTICLE IN PRESS
92 K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93

conclusion is that if we want to perform all the functions of Table 7 provides power consumption figures for a Xilinx
AU, the proposed adder is much more efficient (5190 slices xc2v3000 implementation of the proposed MAC that can
and 60,820 equivalent gates for the 1st implementation of configured to handle either 16- and 32-bit operands. In the
AU, and 3487 slices and 42,523 equivalent gates for the 2nd first case it supports up to 32 operands while in the latter
implementation of AU). The corresponding measurements up to 8. It can be seen that the power consumption is in the
for the four carry-save adders that perform the same order of 100 mW. It should be noted that the specific
operations as AU are 6392 slices and 74,649 equivalent FPGA device has a large quiescent power consumption
gates. (93 mW) regardless of the logic implemented on it. There-
Concerning the frequency, a conventional carry-save fore, the actual power consumption of the implemented
adder is faster than the reconfigurable one, since the latter logic is less than 40 mW.
includes additional combinational logic for supporting the
reconfigurable features. 4. Conclusions
Table 6 shows a qualitative comparison between four
reconfigurable architectures. The common characteristic of The design of a reconfigurable multiplication-accumula-
the first three is that they support operands of variable tion unit was presented. It can be reconfigured in terms of
wordlength. The first is the proposed one, while the second, bitwidth, arithmetic representation, throughput rate, and
third and fourth were introduced in [8], [11] and [10], functionality. Its architecture comprises two principal
respectively. The term ‘‘variable’’ means that the particular units, the multiplication and the addition unit. These two
design parameter/feature can be dynamically reconfigured structures can operate both independently as multiplier and
within a certain range by the reconfiguration logic, while adder and together as a MAC. The superiority of the
‘‘fixed’’ means that it cannot be reconfigured. It can be seen design is achieved through the use of sub-multipliers,
that the proposed architecture supports the reconfiguration repeatable parts and totally operation-independent units.
of more design parameters/features and therefore offers This fact gives the opportunity to power-down part of the
greater flexibility (i.e., plethora of alternative implementa- architecture, reducing power consumption and also to use
tions) than the existing ones. This flexibility makes the each component or the entire MAC in larger designs as
proposed architecture an attractive solution for the reusable IP blocks, providing the designer with a number
implementation of DSP kernels and the design of larger of alternative implementations. The superiority of the
IP cores. Also, there should be noted that only the proposed architecture in terms of flexibility was proven by
proposed architecture features power-down modes of qualitative and quantitative comparisons with similar
operation, depending on the application inputs. existing reconfigurable architectures.

Table 5
Comparison between the proposed Block 32  32 and Corsonello’s et al. new multiplier 32  9

Delay (n) Slices ] of 8  8 products ] of 16  16 products ] of 32  16 products ] of 32  32 products

Proposed block 32  32 21.251 1416 16 4 1 1


New multiplier 32  9 [11] 23 1108 4 2 1 1

Table 6
Qualitative comparison between different reconfigurable architectures

Architecture Word length Functions Arithmetic representation ] pipeline stages ] operands

Proposed Variable MAC, multiplication, addition, data format conversion Unsigned, signed, 2’s complement Variable Variable
[8] Variable Multiplication Unsigned, 2’s complement Fixed Fixed
[10] Fixed Multiplication Unsigned Variable Fixed
[11] Variable Multiplication Unsigned, signed Fixed Fixed

Table 7
Power consumption of proposed reconfigurable MAC

Operand bitwidth ] operands Max. clock frequency (MHz) Total power (mW) Device coverage (slices %)

32 8 12 (one pipeline stage) 118.25 11%


32 8 24 (one pipeline stage) 146.49 11%
16 32 12 (one pipeline stage) 122.49 11%
16 32 24 (one pipeline stage) 153.98 11%
ARTICLE IN PRESS
K. Tatas et al. / INTEGRATION, the VLSI journal 40 (2007) 74–93 93

References Dimitrios Soudris received his Diploma in


Electrical Engineering from the University of
[1] J. Hauser, J. Wawrzynek, Garp: a mips processor with a reconfigur- Patras, Greece, in 1987. He received the Ph.D.
able coprocessor, in: Proceedings of the IEEE FCCM ‘97, Napa, CA, degree from in Electrical Engineering, from the
USA, 1997, pp. 24–33. University of Patras in 1992. He is currently
[2] E. Mirsky, A. De Hon, MATRIX: a reconfigurable computing working as Assistant Professor in Department of
architecture with configurable instruction distribution and deployable Electrical and Computer Engineering, Democri-
resources, in: Proceedings of the IEEE FCCM ‘96, NAPA, CA, USA, tus University of Thrace, Greece. His research
1996. interests include low power design, parallel
[3] T. Miyamori, K. Olokotun, REMARC: reconfiguirable multimedia architectures, embedded systems design, and vlsi
array coprocessor, In: Proceedings of the ACM/SIGDA FPGA ‘98, signal processing. He has published more than 80
Monterey, 1998. papers in international journals and conferences. He was leader and
[4] H. Singh, et al., MorphoSys: an integrated re-configurable architec- principal investigator in numerous research projects funded from the
ture, in: Proceedings of the NATO RTO Symp., On System Concepts Greek Government and Industry as well as the European Commission
and Integration, Monterey, CA, USA, 1998. (ESPRIT II-III-IV and 5th IST). He has served as General Chair and
[5] J. Rabaey, Reconfigurable computing, the solution to low power Program Chair for the International Workshop on power and timing
programmable DSP, in: Proceedings of the ICASSP ‘97, Munich, modelling, optimisation, and simulation (PATMOS). Recently, received
Germany, 1997. an award from INTEL and IBM for the project results of LPGD ]25256
[6] S.C. Goldstein, et al., PipeRench, a reconfigurable architecture and (ESPRIT IV). He is a member of the IEEE, the VLSI Systems and
compiler, IEEE Comput 33 (4) (2000) 70–77. Applications Technical Committee of IEEE CAS and the ACM.
[7] R. Hartenstein, A Decade of Reconfigurable Computing: A Vision-
ary Retrospective, Embedded Turorial, Asia-Pacific DAC, 2001.
[8] R. Lin, Reconfigurable parallel inner product processor architecture,
IEEE Trans. VLSI Syst. 9 (2) (2001) 261–272. Adonios Thanailakis was born in Greece on
[9] B. Parhami, Computer Arithmetic: Algorithms and Hardware August 5, 1940. He received B.Sc. degrees in
Designs, Oxford University Press, 2000. physics and electrical engineering from the
[10] S. Kim, M. Papafethymiou, reconfigurable low energy multiplier for University of Thessaloniki, Greece, 1964 and
multi-media system design, In: Proceedings of the IEEE Annual 1968, respectively, and the M.Sc. and Ph.D.
Workshop on VLSI, 2000. degrees in electrical engineering and electronics
[11] P. Corsonello, et al., Variable precision multipliers for FPGA-based from UMIST, Manchester, UK in 1968 and 1971,
reconfigurable computing systems, in: Proceedings of the FPL 2003, respectively. He has been a Professor of Micro-
Springer LNCS, 2003, pp. 661–669. electronics in Department of Electrical and
Computer Engineering, Democritus University
Konstantinos Tatas received his degree in Elec- of Thrace, Xanthi, Greece, since 1977. He has
trical and Computer Engineering from the been active in electronic device and VLSI system design research since
Democritus University of Thrace, Greece in 1968. His current research activities include microelectronic devices and
1999. He is currently working towards his VLSI systems design. He has published a great number of scientific and
Ph.D. in the VLSI Design and Testing Center technical papers, as well as five textbooks. He was leader for carrying out
in the same University. He has been employed as research and development projects funded by Greece, EU, or other
an RTL designer in INTRACOM SA, Greece organizations on various topics of Microelectronics and VLSI Systems
between 2000 and 2003. His research interests Design (e.g., NATO, ESPRIT, ACTS, STRIDE).
include low-power VLSI design of DSP and
multimedia systems, IP core design and design
for reuse.

George Koutroumpezis received his degree in


Electrical and Computer Engineering from the
Democritus University of Thrace, Greece in 2002.
He is currently working towards his M.S. in the
VLSI Design and Testing Center in the same
University. His research interests include reconfi-
gurable VLSI design, IP core design and design
for reuse.

You might also like