You are on page 1of 4

A Reconfigurable Architecture of A High Performance

32-bit MAC Unit For Embedded DSP


Yiug Li, Jie Chen
Microelectronic R&D Center, Chinese Academy of Sciences
BeiJing, 100029, China

Abstract: aspects:
Tlus paper describes a reconfignrable architecture of a 1. Two 16-bit multipliers with modified Booth
lugh-perfonnance pipelined 32-bit Multiply-Accumulate arithmetic and Wallace Tree stlucture are combined to
Unit (MAC). wluch is designed for a powerful embedded implement 32-bit multiplication. Thus 16-bit and 32-bit
Digital Signal, Processor (DSP). The MAC unit we computations can both be handled by the MAC
design can can) out two 16-bit multiplications in one efficiently.
clock cycle. The 32x16. 32x32. 32x16+80 and 32x32tSO 2. A 2-stage global pipeline is added in the MAC unit
operations can be implemented in two clock cycles. to reach a high data tlnuughput. In the fist global pipeline
These characteristics allow the DSP being applied stage, a self-timed local pipeline is designed to accelerate
efficiently in different s i h i a t i o ~ .A 2 stage pipeline is the paltinl product generation of 32-bit multiplication and
designed for this MAC unit to reach high throughputs. MAC operation, which can decrease the latency by 25%.
Tlus MAC is syntliesizable and bas already been used in This paper is organized as follows: Section 2 gives the
an embedded DSP core. architecture specikation and operation the
of
reconfignrable MAC unit. Section 3 describes
1. Introduction components of the MAC unit in detail. including the
Nowadays. the real-time multiiiiedia system for speech. Booth-Wallace tree Multiplier and Accumulator. Section
video. and image processing is i n dire need of powerful 4 shows how the global and local pipelines work in this
DSPs. In speech codes. such as G.729b. GSM-AMR, MAC, including the self-timed clock generate system. In
Mp.3 and AC-3. the 16-bit MAC operatioils are r e q u i d . section 5 the summary of this MAC unit are presented.
But in wirelessihandheld devices and many other
applications of DSP chip. the 32-bit and even higher 11. Architecture
widths computation sliould be available. As the key A reconfgurable architecture is designed for handling
building of DSP. the multiply-accuniulator (MAC) Unit 16-bit and 32-bit computation simultaneously. Fig 1
designed with new methods to reach not only the
n i i i ~ be
t gives out its complete architecture. The main building
Iugh speed and low power requirements. but also the blocks are two 16-bit Booth-Wallace Tree multipliers and
bigh efficient?, for different applications. In this paper, an an accumulator, containing a 16-bit cany look-ahead
efficient .reconfigunble architecture of a high adder (CLA) and a four inputs adder with Wallace Tree
perronilance 32-bit MAC unit is described. This MAC compressor, wluch uses carry select adders (CSA). The
can c a m out hvo 16-bit niultiplicatioils in one cycle. The inputs of MAC are 32-bit multiplicand A. 32-bit
-32x16. 32x32. 32x16+80 and 32xi2tSO operations are multiplier B and an SO-bit accumulator C. While the
also supported and can be implemented in hv0 clock results are IO-bit MP1, MP2 and an SO-bit R comes from
cycles. the MAC operation. MP1 and Mp2 are two multiply
Compared with traditional MAC unit, this results when the MAC unit carrying two 16-bit
reconfiyrable architecture is innovative in following two multiplication at the same time. R is the result of the

0-7803-7889-X/03/$17.0002003 IEEE
1285
32x16. 32x32. 32sl6+8O and 32x32+80 operations. The
dash line iii the middle of Fig I breaks the two pipeline and -&OBI iinpleinent in the first pipeline'stage. The
stages. final addition of the 4 products and the accumulator are
completed in second pipeline stage.
For an u-stage pipeline can speed up the circuit n times,
a 2-stage local pipeline is used in the first global pipeline
stage. We use 2-stage pipelined 16-bit multipliers and set
them working in pipeline mode. This method can
decxease the PPs generation delay by 25% while adding
just a little internal registers. area. The local pipeline
control will be described in detail in Section 4.

Ill. Components in the MAC unit


In this section, we will specify components in the
MAC unit. They are inultipliers in the first global
pipeline stage and the accumulator in the second one.

A. Booth- Wallace Tree Multiplier


Modified Booth algorithm and Wallace .Tree structure
Figure 1. Architecture of the MAC unit are widely used in modem fast multipliers. The imin
blocks are Booth Encoder, partial product generator.
It is based on the following,equatiomthat two 16-bit Wallace-Tree compressor and finaladder.
midtipliers implement a 32-bit multiplication: Modified Booth algorithm scans the 2's complenient
A and B is inultiplicand and multiplier in b i q multiplicand by equation.3:
respectively: N-2

'.V
A =(-l)AN-J"-' +CA,2'
" - "
.4 = (.4& ,....4.v(&* 4,.4& =A1 .2 2 + .% 8 4

.~N " .
B= .,... B,v,,B.,,I-,...B,B,), = B, 2 * +Bo (1)

Where: w~ieie 4= O .
.4, =(.'I4,",,),
, : A , =(A,,,- ,....4& According to equation 3, Booth algorithm divides the
" , multiplicand e v e v 3 bits (with o w b i t overlapped) and
B, =CBN.I...BN12)2
Bo =(BN,2-l...Bo)2
encodes them as table 1 shows. The five outputs {O, B,
Then. the inultiplication can be denoted as: -B, 2B, -2B) donate 5 different number multiplied by the.
N
- * . multiplier B.
+ io&
.4 B = .i,k, e, 2".+ 2 2 (A,B, + .dB,)
(2) So, PPs of the 16x16 unsignedlsigned multiplication
As equation 2 shows. an N s .N operation can be can he reduced huice.
aclueved mainly by 4 NI2 s NI2 multiplications, and also The Wallace Tree is a reduction architecture that uses
additional shift as well as add operations. So, we can use the cany save adder technique to handle the addition of
two 16-bit inultipliers twice to generate partial products the PPs. A cany save adder accepts three inputs with the
(PP) needed. These partial products are shifted and added'
to aclueve the final result. To balance the pipeline, the same weight, x,, ''j, and ',. generates two outputs ,' .
., "
with the same weight, and ';+I, with a double of the
twice 16-bit ;nultiplications of 4.r "1. .'o*Ba, '1'*BO,

1286
weight. Well, a .row of these adders consist of a 3-2 1 "

compressor. Tlie addition time can be reduced soundly. dO*Ao and that of C are sent to 16-bit CLA adder as

log 1 n
' For an n-bit multiplier. tliere.\vill be' 7 carry save two addends. As in Fig 2, ',,, is the c a w out of the
addcrs in it. 16-bit CLA adder, whose sunmation is the low 16
In the 16-bit unsignectlsigned multiplier. 9 paltial signifcant bits of the final result.
products are sent into 9-to-2 Wallace Tree Compressor
and 2 outputs are imported by the Final Adder to achieve
the final product.

Table 1. Modlfied Booth Encoding Table


a,,, I 0, I a,-, I Operation on B I

Figure 2..TIie detail diagram of Wallace Tree compressor


and 64-bit CSA, FA is the abbreviation of full adder and
HA is that of half adder.
Wallace Tree compressor and the 64-bit CSA adder are
B Acciinriilator
illustrated in.Fig 2. As the figure shows, the carry save
In the second pipeline stage. the addition of PPs
adder techniques are used once again in compressor. The
output of the 64-bit CSA combines with that from 16-bit
CLAwill be the,final result of 32-bit MAC operation.
accumulator C will be ended for 32-bit MAC operation.
As Fig 1 shows. tlus stage consists in a Wallace Tree
IV. Pipeline
compressor with a can). select final adder and a carry
In order to implement the needed 4 PPs of a 32x32
look-ahead adder of 16-bit. We define
. . ' 4 ~ 4 6 as the PP operation in one cycle, a local pipeline and a sub-region
~.
of .4, .E, ' (.>2 bits) and ' ' d o
9
as the high 16 clock generator are designed in tlus MAC unit. A 2-stage
pipelined multiplier used in tlus MAC unit saves 25% of
. .
16-bit inultiplication delay compared with non-pipelined
siguiiicant bits of PP o f 4 '0. " w . .0~ is the part of the
multiplier working in 32-bit computation. Fig 3 gives a
accumulator C (from 80-bit to 17-bit), o'r' and explanation for 32x32 multiplication flow. 32s32+80
" " .~ operation in pipeline can also be specified by this figure.
c31...co are fl1e pps of il*'o and4*'1, s,,~..sa are the
As in figure 3. the zone b e h e e n upper two thick dash

extension of sign bits of the PPs of A'Bn and &*'I, lines represents the first global pipeline stage. The hvo
thin dash lines in this zpne divide this global stage into 3
o n ~ yone row according to tie MAC operation
~ I U C I Iis
local pipeline stages, which are controlled by the
mode (with . the, operands are unsignedunsigned,
self-timed clock from the sub-region clock generator.
unsignedsigned, signeusigned and signedunsigned
respectively). Tlie low 16 significant bits of PP of

1287
period and stop the clock after final products cany out.
Synthesized by SMIC 0.18um process library and after
placement and route work, this self-time sub-region clock
generator work stably and reach about 660Mhz.

V. Summary
To m e t the requirement of lugh tluougliput and high
efficiency of a powerful SIMD DSP>a high perfonumice
and multi-function 32-bit multiply-accumulate unit has
been designed. It can implement two 16 X 16 operation
sirnultaneously and one 32x16, one 32x32, one
Figure 3 . The pipeline of 32.~32 and 32x32180 32x16+80, one 32x32+80 operations on unsignedlsigned
operations operands In the multiplier. Booth encoding and Wallace
Tree partial products coinpress technology have been
The sub-region clock generator is described in figure used, and a 2-stage local pipeline and sub-region clock
4. genecltor are embedded in the global pipeline. This
The sub-region clock is generated by the dummy cell MAC unit bas been synthesized and simulated to confinu
~vluchCoMeCted like a ring oscillator. The delay of the the correction of this architecture. After synthesis,
dunuiuy cell closely matched to 1/2 niultiplier and placernent and route, this reconfigurable MAC can work
internal register delay. decides period of the local clock. stably at 220Mle.
It also decides the global clock frequency. Because the
duimny cell of self-time sub-region clock generator and References
multipliers are identical cells on the same chip, the speed [I] Yupn Liao and David B. Roberts, A
of them changed together with process variation. high-performance and low power 32-bit
Multiply-Accumulate Unit with
single-ins~ction-multiple-data (SIMD) feature, IEEE
Journal Of Solid-state Circuits. Vol. 37. No. 7, July.
2002.
[2] Sang-Hoon LEE, Seung-Jun BAE and Hong-June
PARK. A Compact Radix-64 54x54 CMOS redundant
Binary Parallel Multiplier. IEICE Trans. ELECTRON.,
Vol EES-C. N0.6, JUNE,2002.
[3] Gary W. Bewick, Fast Multiplication Algorithms and
Figure 4. Sub-region clock generator for local pipeline Implementation, PH.D., Dissertation, Stanford University.
control. February. 1994
[4] J. Mori, M. Nagamatsu. M. Hirano. S. Taiuka. M.
The sub-region clock of the proposed pipeline stage Noda, Y Toyoshinla, K. Hashimoto, H. Hayashida and K.
works as follows. It remains low wlule the multiplier is Maeguchi. "A 10-ns 54x54-b Parallel Structured Full
idle. The global clock is a start signal, when it is issued, A m y Multiplier with 0.5unu CMOS Teclmology," IEEE
the internal oscillator starts and local clock generated J. Solid-State Circuits, vol. 26, pp. 600-606. April 1991.
after the dnnuny cell delay. Internal registers latch the [ 5 ] D. RadllakriSbnaq "Low Voltage CMOS Full Adder
partial products at the pose edge of the clock. A shift Cells."Electmnics Letters. vol. 35, no. 21, pp. 1792-1794,
register works as the pipeline controller, counts the clock Oct. 1999.

1288

You might also like