You are on page 1of 4

A High-Speed/Low-Power Multiplier Using an

Advanced Spurious Power Suppression Technique


Kuan-Hung Chen, Yuan-Sun Chu, and Yu-Min Chen Jiun-In Guo
Department of Electrical Engineering, Department of Computer Science and Information Engineering,
National Chung Cheng University, National Chung Cheng University,
Chia-Yi 621, Taiwan, R.O.C. Chia-Yi 621, Taiwan, R.O.C.
ckh@vlsi.ee.ccu.edu.tw; chu@ee.ccu.edu.tw; jiguo@cs.ccu.edu.tw
93griffeypop@vlsi.ee.ccu.edu.tw
Abstract—This study provides the experience of applying an Design [6] is a vector MAC unit that can perform one 64 × 64,
advanced version of our former Spurious Power Suppression two 32 × 32, four 16 × 16, or eight 8 × 8 signed/unsigned
Technique (SPST) on multipliers for high-speed and low-power multiply-accumulations. The design [7] involves a
purposes. To filter out the useless switching power, there are
two approaches, i.e. using registers and using AND gates, to fixed-width 32-bit left-to-right multiplier which obtains 8%
assert the data signals of multipliers after the data transition. speed improvement, 14% power reduction, and 13% area
The simulation results show that the SPST implementation with saving. Meanwhile, design [8] explores a design methodology
AND gates owns an extremely high flexibility on adjusting the for high-speed modified Booth multipliers. Finally, design [9]
data asserting time which not only facilitates the robustness of adopts an advanced 90-nm dual-Vt CMOS technology to
SPST but also leads to a 40% speed improvement. By adopting
implement a 16 × 16 bit multiplier that consumes 9mW at
a 0.18-µm CMOS technology, the proposed SPST-equipped
multiplier dissipates only 0.0121 mW per MHz in H.264 texture 1GHz with 1.3V. Our former SPST is demonstrated to reduce
coding applications, and obtains a 40% power reduction. 20% to 30% power dissipation of H.264 transform coding and
some multimedia computations [10]-[11]. The SPST uses a
I. INTRODUCTION detection logic circuit to detect the effective data range of
Lowering down the power consumption and enhancing the arithmetic units, e.g. adders, or multipliers. When a portion of
processing performance of the circuit designs are undoubtedly data does not affect the final computing results, the SPST uses
the two important design challenges of wireless multimedia data controlling circuits to latch this portion to avoid useless
and DSP applications, in which multiplications are frequently data transitions occurring inside the arithmetic units. Besides,
used for key computations, such as FFT, DCT, quantization, there is a data asserting control realized by using registers to
and filtering. This study examines some modern multiplier or further filter out the useless spurious signals of the arithmetic
MAC designs [1]-[9]. Design [1] presents a multiplier using units every time when the latched portion is turned on.
the DRD unit to select the input operand with a smaller Although this asserting control brings evident power
effective dynamic range to yield the Booth codes. The direct reduction, it may induce additional delay which is the
report of [1] shows that the multiplier can save over 30% problem this study wants to solve. When implemented in a
power dissipation than conventional ones. The design [2] tries 0.18-µm CMOS technology, the proposed SPST-equipped
to balance the signal of the partial product reducing (PPR) multiplier dissipates only 0.0121 mW per MHz in H.264
stage, and shorten the path of the PPR stage to prevent the texture coding applications, and obtains a 40% power
snow-balling glitch effect of array multipliers. The design [2] reduction than the conventional multiplier.
can save about 20% power dissipation when compared with The remainder of this study is arranged as follows:
conventional right-to-left multipliers. The design [3] turns off Section II reviews the former SPST and presents the
some columns in the multiplier array whenever their outputs improvement of this work. Section III illustrates the multiplier
are known, thus saving 10% power consumption at the cost of adopting the proposed SPST. Section IV addresses the
20% area overhead for random input. The design [4] uses a implementation and performance comparison of this work. At
DRD unit to detect the dynamic range of the inputs, and last, a conclusion is given in Section V.
adopts three separate Wallace trees for the 4 × 4, 8 × 8, and II. PROPOSED SPURIOUS POWER SUPPRESSION
16 × 16 multiplications which certainly induces area and TECHNIQUE (SPST)
capacitance overheads. Under a 0.13-µm CMOS technology,
the design [4] can obtain a 20% power reduction over the The former SPST has been discussed in [10]-[11]. This paper
conventional multiplier at a cost of 44% in area overheads. explores two implementing approaches for the SPST and
Design [5] proposes a 32-bit SIMD MAC unit which is a then compares their efficiency to provide diverse reference
coprocessor to the Intel® XScaleTM microprocessor. Under a materials for applying the SPST. For completeness of the
article and easy understanding for readers, we review the
0.18-µm CMOS technology, [5] presents that the processor
former SPST first. In Fig. 1, the SPST is illustrated through a
dissipates 450 mW at 600MHz with 1.3V supply voltage.
The authors would like to thank the National Science Council of the
Republic of China, Taiwan for financially supporting this research under
Contract No. NSC 94-2220-E-194-012.

1-4244-0921-7/07 $25.00 © 2007 IEEE.


3139
low-power adder/subtractor design example. The slightly different from the one in the first implementation.
adder/subtractor is divided into two parts, i.e. the Most That is, the range of Φ can be set as 0 < Φ <∆ to filter out the
Significant Part (MSP) and the Least Significant Part (LSP). glitch signals and to keep the computation results correct, as
The MSP of the original adder/subtractor is modified to well. This feature allows upper-level systems to assert the
include detection logic circuits, data controlling circuits, sign close signal with an arbitrarily short delay closing to the
extension circuits, and some glue logics for calculating the positive edge of the clock signal, which provides a more
carry in and carry out signals. The most important part of this flexible controlling space for the delay Φ. When speed is
study is the design of the control signal asserting circuits, seriously concerned, this implementing approach enables an
denoted as asserting circuits in Fig. 1, following the detection extremely high flexibility on adjusting the data asserting time
logic circuits. The former implementing approach of the of the SPST-equipped multipliers. Therefore, the proposed
control signal assertion circuits is using registers, which is advanced SPST can benefit multipliers on both high-speed
illustrated in the shadow area in Fig. 2. The three output and low-power features.
signals of the detection logic are given a certain amount of Ψ
delay before they assert, demonstrated in the timing diagram
shown in Fig. 3. The delay Φ, used to assert the three output
signals, must be set in the range of Ψ < Φ <∆ to filter out the
glitch signals as well as to keep the computation results
correct, where Ψ and ∆ respectively denote the data transient
period and the earliest required time of all the inputs. The
range of Φ is also shown as the shadow area in Fig. 3. Using
registers to control the signal assertions can obviously reduce
the spurious power dissipation of adders/subtractors
[10]-[11]. However, the restriction that Φ must be greater Fig. 3. The timing diagram of control signals of detection logic circuits after
than Ψ to guarantee the registers from latching the wrong assertions.
values of control signals usually decreases the overall speed
of the applied designs. This issue should be noticed in
high-end applications which demands both high-speed and
low-power requirements.

Fig. 4. The detection logic circuits (at the left-hand side of the dotted line)
using an AND gate to assert the control signals.
2AC9 0010101011001001
006A x) 00000000011010100

Fig. 1. A low-power adder/subtractor design example adopting the SPST +0 +0 +0 +0 +2 -1 -1 -2


11010101001101110 PP0
11101010100110111 PP1
11101010100110111 PP2
00101010110010010 PP3
00000000000000000 PP4
00000000000000000 PP5
00000000000000000 PP6
+)00000000000000000 PP7
00000000000100011011011100111010 (11B73A)
Fig. 5. An illustration of multiplication using modified Booth encoding

III. LOW-POWER MULTIPLIER DESIGN


We equip the SPST on a 16*16-bit row-based Booth
multiplier. The details are described in the follow:
A. Applying the SPST on the modified Booth encoder
Fig. 2. The detection logic circuits (at the left-hand side of the dotted line)
using registers to assert the control signals. Fig. 5 shows a computing example of Booth multiplying two
To solve the problem mentioned above, an AND gate is numbers “2AC9” and “006A”, where the shadow denotes
proposed to replace the registers to control the signal that the numbers in this part of Booth multiplication are all
assertion, as illustrated in the shadow area in Fig. 4. The zero so that this part of the computations can be neglected.
timing control of the delay Φ in this implementation is Saving those computations can significantly reduce the

3140
power consumption caused by the transient signals. dissipates only 0.0121 mW per MHz when computing the
According to the analysis of the multiplication shown in Fig. multiplication of texture coding in H.264.
5, we propose the SPST-equipped modified-Booth encoder,
which is controlled by a detection unit. The detection unit has
one of the two operands as its input to decide whether the
Booth encoder calculates redundant computations. As shown
in Fig. 6, the latches can respectively freeze the inputs of
MUX-4 to MUX-7 or only those of MUX-6 to MUX-7 when
the PP4 to PP7 or only the PP6 to PP7 are zero, to reduce the
transition power dissipation. Such cases occur frequently in
e.g. FFT/IFFT, DCT/IDCT and Q/IQ adopted in encoding or
decoding multimedia data.

Fig. 7. The proposed low-power SPST-equipped multiplier, where the


Fig. 6. The SPST-equipped modified-Booth encoder
fraction values (M/L) denote the bit-widths of the MSP and LSP of the
SPST-equipped adders
B. Applying the SPST on the compression tree
TABLE 1
The proposed SPST-equipped multiplier is illustrated in Fig. SIMULATION RESULTS OF THREE MULTIPLIERS, WHERE THE POWER
7. The PP generator generates five candidates of the partial DISSIPATION IS MEASURED USING NANOSIM WHEN COMPUTING TEXTURE
products, i.e. {-2A, -A, 0, A, 2A}, which are then selected CODING OF “STEFAN” SEQUENCE IN H.264
Design Power/MHz Norm. P Area (Tr.) Max. Freq.
according to the Booth encoding results of the operand B.
Original tree MUL 0.0201 1 10544 200 MHz
Moreover, when the operand besides the Booth encoded one SPST using REG 0.0118 0.59 10910 142 MHz
has a small absolute value, there are opportunities to reduce SPST using AND gate 0.0121 0.60 11028 200 MHz
the spurious power dissipated in the compression tree.
According to the redundancy analysis of the additions [11],
we replace some of the adders in compression tree of the
multiplier with the SPST-equipped adders, which are marked
with oblique lines in Fig. 7. The bit-widths of the MSP and
LSP of each SPST-equipped adder are also indicated in
fraction values (M/L) nearing the corresponding adder in Fig. 7.
IV. IMPLEMENTATION, PERFORMANCE EVALUATION, AND
COMPARISON
The SPST-equipped multiplier design has been realized by
following the standard cell-based design flow with an in-house
0.18-µm CMOS cell library which is constructed following the
TSMC 1P6M 0.18-µm CMOS technology. This design is Fig. 8. Comparison of power dissipation of the tree multiplier and the
SPST-equipped multipliers adopting the two implementing approaches, i.e.
verified via C/Matlab behavioral simulation, nLint HDL using registers and using AND gates, with Gaussian distributed inputs.
coding rule check, VERILOG RTL simulation, SYNOPSYS
logic synthesis, and NanoSim transistor-level simulation. Furthermore, we measure the power dissipation of the
When computing the multiplications of texture coding above three multipliers in terms of Gaussian distributed data
of H.264, the simulation results of the original tree multiplier patterns with different bit ranges, i.e. 4-bit, 8-bit, 12-bit, and
and the two SPST-equipped multipliers with different 16-bit. The results are illustrated in Fig. 8, where the X and
implementing approaches are listed in Table 1. The table Y-axis of the plots denote the bit-widths of the two operands
illustrates that both the SPST-equipped multipliers using of multipliers, respectively, and the Z-axis denotes the power
registers and AND gates save about 40% power dissipation of reduction compared with the original tree multiplier. In Fig.
the original tree multiplier. However, the maximum speed of 8, both SPST-equipped designs have the largest power saving
the SPST-equipped multipliers using AND gates is 40% when both the two input data are 4-bit wide. The power
higher (200/142−1) than that using registers. Besides, Table saving decreases as the data bit-widths increase. At last, the
1 also shows that the proposed SPST-equipped multiplier

3141
power saving becomes negative under some conditions with and the switching activities are expected to be the same.
large input data bit-widths. From Fig. 8, we can also know Furthermore, we plot the scaled results of the 16 × 16-bit
that the power saving of the SPST-equipped multiplier using multipliers listed in Table 2 in terms of power per unit
registers is slightly larger than that of the SPST-equipped frequency in Fig. 9. From this figure, we can also find that the
multiplier using AND gates. Nevertheless, using registers proposed multiplier has better power efficiency than the
results in longer critical path delay as shown in Table 1. existing multipliers. Designs [2], [5], and [7] are 32-bit
multipliers which are considered as reference indexes only.
TABLE 2
THE PERFORMANCE COMPARISON WITH THE EXISTING MULTIPLIERS V. CONCLUSION
Delay This study proposes a multiplier adopting the new SPST
Design Feature Tech. Power (mW) Area
(ns)
implementing approach. The simulation results show that the
(1) 19.65 for JPEG
Huang
(1) 32b × 32b 0.18 µm (2) 40.65@ 100MHz for 7.25
74598 power reduction of the new approach, i.e. a 40% saving, is
[2] (tr.) very close to that of the former approach. Besides, the new
random data
(1)Coprocessor approach leads to a 40% speed improvement when compared
Liao [5] (2) SIMD 0.18 µm 900@1.6V, 800MHz 1.25 N.A. with the former one. When implemented in a 0.18-µm
(3) 32b MAC
(1) 32b × 32b 0.35 µm/ 19743
CMOS technology, the proposed SPST-equipped multiplier
Wang [7] 79.86 14.01 dissipates only 0.0121 mW per MHz in H.264 texture coding.
(2)Fixed-width 3.3V (gate)
17.30 for normal 0.337 In addition, the performance of the proposed design under the
Chen [1] (1) 16b × 16b 0.25 µm 8.3
distribution inputs (mm2) conditions of different bit-width input data is explored. The
Scalable length 0.13 1.04@100MHz for random 6388 results also show that the new SPST approach not only owns
Lee [4] N.A.
of 4b, 8b, 16b µm/1.2V data (gate)
equivalent low-power performance but also leads to a higher
(1) 16b × 16b (1) 9@1.3V, 1GHz
Hsu [9] (2) Sleep mode 90 nm (1) 7.9 × 10−2 @50MHz, 1
0.03 maximum speed when compared with the former SPST
(mm2) approach. Moreover, the proposed SPST-equipped multiplier
(3) Duel VT 0.57V
(1) 1.21@100MHz, 1.8V also has better power efficiency when compared with the
(1) 16b × 16b
for H.264 texture coding
11028 existing modern multipliers.
Proposed 0.18 µm (2) 2.82@100MHz, 1.8V 5
(2) SPST (tr.)
for avg. normal REFERENCES
distribution inputs
[1] O. Chen, S. Wang, and Y. W. Wu, “Minimization of switching activities
of partial products for designing low-power multipliers,” IEEE Trans.
0.08 VLSI, vol. 11, no. 3, pp. 418-433, June 2003.
0.07 [2] Z. Huang, and M.D. Ercegovac, ”High-performance low-power
0.06 left-to-right array multiplier design,” IEEE Trans. on Computers, vol.
54, no. 3, pp. 272-283, Mar. 2005.
Scaled P/f

0.05
[3] M. C. Wen, S. J. Wang; Y. N. Lin, “Low-power parallel multiplier with
0.04 column bypassing,” Electronic Letters, vol. 41, no.12, pp. 581-583,
0.03 May 2005.
0.02 [4] H. Lee, “A power-aware scalable pipelined Booth multiplier,” IEEE Int.
SOC Conf., pp.123-126, Sep. 2004.
0.01
[5] Y. Liao, and D.B. Roberts, “A high-performance and low-power 32-bit
0 multiply-accumulate unit with single-instruction- multiple-data
[1] [4] [9] Proposed
(SIMD) feature,” IEEE J. of Solid-State Circuits, vol. 37, no. 7, pp.
926-931, July 2002.
Fig. 9. The comparison of power efficiency [6] A. Danysh, and D. Tan, “Architecture and implementation of a
vector/SIMD multiply-accumulate unit,” IEEE Trans. on Computers,
The performance comparison of the proposed multiplier vol. 54, no. 3, pp. 284-293, Mar. 2005.
with some existing designs, which report more complete [7] J. S. Wang, C. N. Kuo, and T. H. Yang, “Low-power fixed-width array
information, is listed in Table 2. To fasten the comparison multipliers,” IEEE Symp. Low Power Electronics and Design, pp.
307-312, 9-11, Aug. 2004.
fairly, we scale the power results to the same technology, i.e. [8] W. C Yeh, C. W. Jen, “High-speed Booth encoded parallel multiplier
0.18 µm technology with 1.8V, according to the geometric design,” IEEE Trans. Computer, vol. 49, no. 7, pp.692-701, July 2000.
feature and the well-known power approximation [9] S. K. Hsu, S. K. Mathew, M. A. Anders, B. R. Zeydel, V. G. Oklobdzija,
R. K. Krishnamurthy, and S. Y. Borkar, “A 110 GOPS/W 16-bit
P = α CV 2 f
, where P, α i , C, V, and f denote power
∑ i multiplier and reconfigurable PLA loop in 90-nm CMOS,” IEEE J.
i Solid-State Circuits, vol. 41, no. 1, pp. 256-264, Jan. 2006.
dissipation, switching activity, capacitance, voltage, and [10] K. H. Chen, K. C. Chao, J. I. Guo, J. S. Wang, and Y. S. Chu, "An
frequency, respectively. The scaling formulation used Efficient Spurious Power Suppression Technique (SPST) and its
Applications on MPEG-4 AVC/H.264 Transform Coding Design,"
is Porg (extended from [12]) IEEE Int. Symp. Low Power Electronics Design, San Diego, California,
Psca =
(Techorg / 0.18) 2 * (Vorg / 1.8) 2 Aug. 8-10 2005.
[11] K. H. Chen, Y. M. Chen, and Y. S. Chu, “A Versatile Multimedia
where Psca , Porg , Techorg , and Vorg denote the scaled power Functional Unit Design Using the Spurious Power Suppression
Technique,” IEEE Asian Solid-State Circuits Conf., Hangzhou, China,
dissipation, original power dissipation, line width in the Nov. 2006.
original technology, and original voltage, respectively. The [12] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A dynamic scaling FFT processor
capacitance is assumed to be proportional to the silicon area, for DVB-T applications,” IEEE J. Solid-State Circuits, vol. 39, no. 11,
pp. 2005-2013, Nov. 2004.

3142

You might also like