Professional Documents
Culture Documents
Abstract—This paper presents a new approach of using the low power consumption circuit to meet the high demand of
improved hybrid LUT-based architecture for the low-error and portability and mobility.
efficient fixed-width squarer circuits. By employing both LUT- Moreover, lookup table (LUT)-based computation has great
based and simple conventional logic circuits, the good trade-off
between hardware complexity and performance can be achieved. potential to be employed in future DSP and communication
Moreover, the mathematical identity of squaring operation is systems because it mainly relies on memory access operations
exploited so that the error can be reduced significantly compared resulting in high speed computation and low power consump-
with other methods. The proposed method can also improve the tion [8]. The disadvantage that limits the popularity of LUT-
speed and reduce the area of squarer circuit. The implementation based architecture in current systems is the exponential growth
and chip measurement results in 0.18-µm CMOS technology are
also presented and discussed. of LUT size as the operand width increases. Therefore, it is
reasonable to find a proper hybrid architecture that takes the
I. I NTRODUCTION advantage of both LUT-based circuit and conventional logic
circuits so that the good trade-off between squarer performance
High performance multiplier and squarer circuits are exten- and hardware complexity can be achieved. And in this paper,
sively used and play an essential role in modern and future we present a new approach for designing the fixed-width
digital signal processing (DSP) systems due to the increas- squarer circuit for specific DSP applications by employing an
ing demand of high speed applications. Especially, squaring improved hybrid LUT-based architecture to reduce the error,
is a fundamental arithmetic operation in many applications delay and hardware complexity. The main contribution of our
such as Viterbi decoding, digital filtering, vector quantiza- paper is that an improved architecture for low error and area-
tion, pattern recognition and image compression. However, delay efficient fixed-width squarer is proposed by employing
direct implementation of general parallel squarer circuits often a specific mathematic identity of squaring operation in a new
requires high hardware complexity and may lead to high approach with some improvements.
power consumption. Moreover, the high increase of mobile The rest of this paper is organized as follows. Section II
and portable devices leads to the emerging requirement of low introduces briefly about fixed-width squarer and section III
power consumption and low complexity digital circuits [1]. presents the proposed hybrid LUT-based architecture for fixed-
Therefore, recently, there have been many researches trying to width squarer. Section IV shows the error analysis results
find the efficient architectures for low area, low power and high and section V summaries the implementation and chip mea-
performance squarer circuit. Some proposed methods such as surement results of proposed squarer architecture using 0.18-
folded technique [2], merged technique [3] and K.-J. Cho et al. µm CMOS technology. Finally, the conclusion is included in
method [4] employ some mathematic identities to reduce the section VI.
hardware complexity and improve the performance of the full-
II. F IXED - WIDTH SQUARER
width squaring circuit. A. G. M. Strollo et al. [5] proposed an
improved method that combines Booth recoding with folded In this paper, we consider the square of W -bit binary
technique and some specific sub-circuits. number X and the full-width (2W -bit) square X 2 can be
On the other hand, in many DSP applications, when the written as:
squaring result is not required to be full-width but the same bit- (W ∑ −1 )2 W∑−1 W
∑ −1
i
width as the operand, the architecture for high performance, S=X = 2
xi 2 = xi xj 2i+j (1)
low complexity fixed-width squarer becomes an important i=0 i=0 j=0
topic with many issues needed to be considered such as error where xi and xj denote the binary bits of X. Fig. 1 shows the
compensation, hardware efficiency and squarer performance. original partial product matrix (PPM) for a general full width
Authors in [6] and [7] have proposed some methods to improve 8-bit squarer in which each partial product bit pij is computed
the accuracy and performance of the truncated and fixed-width by an AND logic operation as: pij = xi xj .
squarer circuits, but more improvements are desired, especially In a full-width squarer, all partial product bits in PPM
for future DSP systems which require high performance, are summed by using compression tree and adder [2]. Some
478
A
Coarse
LUT
W-3
Adder Y2
Y
A+B
Fine
LUT
479
TABLE I
F OUR - FOLDING METHOD FOR HYBRID LUT- BASED FIXED - WIDTH SQUARER .
M SB2 Y DH DL
00 xW −3 . . . x1 x0 000. . . . . . . . . . . . 00 (DH0) 00. . . . . . . . . . . . 00
01 xW −3 . . . x1 x0 00xW −3 . . . x1 x0 (DH1) 0xW −3 . . . x0 1
10 xW −3 . . . x1 x0 01xW −3 . . . x1 x0 (DH2) 00. . . . . . . . . 00
11 xW −3 . . . x1 x0 1xW −3 . . . x1 x0 0 (DH3) 0xW −3 . . . x0 1
TABLE II
E RROR ANALYSIS RESUTS FOR DIFFERENT FIXED - WIDTH SQUARER
ARCHITECTURES .
ME MaxE MSE
W Method
(LSB) (LSB) (LSB2 )
Walter et al. [9] 0.166 1.215 0.207
Garofalo et al. [12] 0.334 - 0.441
8
Garofalo et al. [7] 0.166 1.260 0.163
Proposed 0.100 0.106 0.105
Walter et al. [9] 0.166 - 0.226
Garofalo et al. [12] 0.333 - 0.529
10
Garofalo et al. [7] 0.166 1.263 0.181
Proposed 0.119 1.062 0.120
Walter et al. [9] 0.167 - 0.246
Garofalo et al. [12] 0.333 - 0.607
12
Garofalo et al. [7] 0.167 1.308 0.200
Proposed 0.118 1.103 0.119
Walter et al. [9] 0.167 - 0.266 Fig. 5. ADP results for different values of W .
Garofalo et al. [12] 0.333 - 0.693
14
Garofalo et al. [7] 0.167 1.365 0.220
Proposed 0.120 1.117 0.124
Walter et al. [9] 0.167 1.862 0.287
16
Garofalo et al. [12] 0.333 - 0.776 Synopsys Design tools using the same standard cell library
Garofalo et al. [7] 0.167 1.432 0.240 with the same design parameters and constrains for different
Proposed 0.122 1.120 0.125
values of W . For the designs with W ≥ 10, the coarse/fine
LUT splitting method is also employed to further reduce
TABLE III the LUT size as shown in Fig. 4. Table III shows the LUT
LUT PARAMETERS AND COMPRESSION RATIO OF THE PROPOSED
FIXED - WIDTH SQUARER DESIGNS WITH LUT SPLITTING TECHNIQUE .
parameter, size and compression ratio of different fixed-width
squarer designs using proposed hybrid LUT-based architecture
W 8 10 12 14 16 and coarse/fine LUT splitting method in which A and B denote
A - 5 7 8 9 the input widths of two split LUTs as depicted in Fig. 4. The
B - 3 3 4 5
compression ratio is calculated as the ratio between direct LUT
Proposed LUT (bits) 320 704 3072 10752 38912
size and proposed LUT size (in bits).
Direct LUT (bits) 2048 10240 49152 229376 1048576
Compression ratio 6.4:1 14.5:1 16:1 21.3:1 26.9:1 The implementation results in 0.18-µm CMOS technology
are presented in table IV and Fig. 5. It is shown that the
proposed fixed-width squarer can reduce the area and delay
significantly when compared with other methods. To compare
Table II summaries the error analysis results and the com- the overall squarer performance with different methods, the
parison between proposed squarer and other architectures. area-delay product (ADP) is used as a factor of merit for com-
It is shown that the proposed fixed-width squarer results in parison. The bold numbers in parentheses of the fifth column
lower error compared with others. The mean error and mean in table IV present the normalized ADP results of different
square error of proposed squarer can be reduced by up to fixed-width squarer designs with each value of W . Compared
40% compared with these values of V. Garofalo et al. method with the fixed-width squarer method as presented in [7], the
presented in [7] whereas the maximum error can be reduced by proposed hybrid LUT-based architecture leads to the reduction
up to 22%. The low error merit makes the proposed method of ADP up to 42%. Fig. 6 is the chip microphotograph of the
highly applicable for future DSP applications which require proposed 8-bit fixed-width squarer which is fabricated with
not only high speed and low power but also low error squarer 0.18-µm CMOS technology. Fig. 7 presents the chip functional
circuits. measurement using logic analyzer (right side) which is the
same as the simulation result in Modelsim software (left side).
V. I MPLEMENTATION RESULTS And table V shows the measurement results for average power
The proposed fixed-width squarer and other architectures consumption and longest (critical) path delay (between the
have been implemented in 0.18-µm CMOS technology by input x6 and the output s7 ) as depicted in Fig. 8.
480
TABLE IV 40 µ m
I MPLEMENTATION RESULTS OF DIFFERENT FIXED - WIDTH SQUARER
ARCHITECTURES USING 0.18-µ M CMOS TECHNOLOGY.
Area Delay
W Method ADP (x103 )
(103 µm2 ) (ns)
Folded 2.3 6.6 15.3 (1.34)
Direct LUT-based 4.0 4.8 19.2 (1.68)
8
Garofalo et al. [7] 1.9 5.9 11.4 (1.00)
Proposed 1.5 4.4 6.6 (0.58) 40 µ m
Folded 3.8 7.4 28.3 (1.33)
Direct LUT-based 19.6 6.2 121.7 (5.73)
10
Garofalo et al. [7] 3.3 6.5 21.2 (1.00)
Proposed 2.7 5.6 15.1 (0.71)
Folded 6.1 8.6 52.9 (1.20)
Direct LUT-based 48.8 6.9 336.7 (7.66)
12
Garofalo et al. [7] 5.4 8.2 43.9 (1.00)
Proposed 5.0 6.7 33.5 (0.76)
Folded 8.7 9.9 86.6 (1.23)
Direct LUT-based 98.9 7.9 762.4 (11.1) Fig. 6. Chip microphotograph of the proposed 8-bit fixed-width squarer.
14
Garofalo et al. [7] 7.7 9.1 70.5 (1.00)
Proposed 7.4 7.6 56.2 (0.80)
Folded 10.5 11.0 115.7 (1.21) X 7E 7F 80 81 82
16
Direct LUT-based 208.7 9.4 2024.4 (21.2) x7
Garofalo et al. [7] 9.2 10.4 95.4 (1.00) x6
Proposed 8.9 9.0 80.3 (0.84)
x5
x4
TABLE V x3
C HIP PARAMETERS OF THE PROPOSED 8- BIT FIXED - WIDTH SQUARER . x2
Technology 0.18-µm x1
Supply voltage (V DD) 1.8 V x0
Average power consumption 0.362 mW @ 50 MHz S 3E 3F 40 41 42
Longest path delay 4.7 ns s7
s6
s5
s4
VI. C ONCLUSION s3
In this paper, we have presented an efficient approach for de- s2
sign of the fixed-width squarer circuits with low error and high s1
area-delay efficiency. The improved hybrid architecture em- s0
ploys both LUT-based and simple conventional logic circuits to
achieve the good trade-off between the circuit performance and Fig. 7. Simulation result in Modelsim software (left) and measurement result
complexity. The implementation and chip measurement results in logic analyzer (right).
in 0.18-µm CMOS technology have been shown together with
some discussions. Compared with other methods presented in
literature for the fixed-width and truncated squaring circuits, [3] R. K. Kolagotla, N.R. Griesbach and H.R. Srinivas, “VLSI implementa-
tion of 350 MHz 0.35 um 8 bit merged squarer,” Electronic Letters, vol.
the proposed method not only reduces the error significantly, 34, no. 1, pp. 47-48, Jan. 1998.
but also improves the squarer performance and area efficiency. [4] K.-J. Cho and J.-G. Chung, “Parallel squarer design using pre-calculated
Therefore, it has the great potential to be applied in modern sums of partial products,” Electronic Letters, vol. 43, no. 25, pp. 1414-
1416, Dec. 2007.
and future DSP and multimedia systems. [5] A. G. M. Strollo and D. De Caro, “Booth folding encoding for high
performance squarer circuits,” IEEE Trans. Circuits and Systems II:
ACKNOWLEDGMENT Analog and Digital Signal Processing, vol. 50, no. 5, pp. 250-254, May
The chip presented in this paper was fabricated in the chip 2003.
[6] Kyung-Ju Cho and Jin-Gyun Chung, “Adaptive error compensation for
fabrication program of VLSI Design and Education Center low error fixed-width squarers,” IEICE Trans. Information and Systems,
(VDEC), The University of Tokyo, in collaboration with vol. E90-D, no. 3, pp. 621-626, Mar. 2007.
ROHM CO. LTD. [7] Valeria Garofalo, Marino Coppola, Davide De Caro, Ettore Napoli, Nicola
Petra, Antonio G. M. Strollo, “A novel truncated squarer with linear
R EFERENCES compensation function,” Proc. 2010 IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 4157-4160, Jun. 2010.
[1] Anantha P. Chandrakasan, Samuel Sheng, and Robert W. Brodersen, [8] Pramod Kumar Meher, “LUT optimization for memory-based computa-
“Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. tion,” IEEE Trans. Circuits and Systems-II: Express Brief, vol. 57, no. 4,
27, no. 4, pp. 473-484, Apr. 1992. pp. 285-289, Apr. 2010.
[2] J. Pihl and E. Aas, “A multiplier and squarer generator for high per- [9] E. G. Walters III and M. J. Schulte, “Efficient function approximation
formance DSP applications,” Proc. 39th Midwest IEEE Symposium on using truncated multipliers and squarers,” Proc. 17th IEEE Symposium
Circuit and Systems, pp. 109-112, Aug. 1996. on Computer Arithmetic, pp. 232-239, Jun. 2005.
481
x6 s7
4.7 ns
482