You are on page 1of 6

Low-Error and Efficient Fixed-Width Squarer for

Digital Signal Processing Applications


Van-Phuc Hoang and Cong-Kha Pham
VLSI Laboratory, Department of Electronic Engineering
The University of Electro-Communications
1-5-1 Chofugaoka, Chofu-shi, Tokyo, 182-8585, Japan
Email: hoang (at) vlsilab.ee.uec.ac.jp

Abstract—This paper presents a new approach of using the low power consumption circuit to meet the high demand of
improved hybrid LUT-based architecture for the low-error and portability and mobility.
efficient fixed-width squarer circuits. By employing both LUT- Moreover, lookup table (LUT)-based computation has great
based and simple conventional logic circuits, the good trade-off
between hardware complexity and performance can be achieved. potential to be employed in future DSP and communication
Moreover, the mathematical identity of squaring operation is systems because it mainly relies on memory access operations
exploited so that the error can be reduced significantly compared resulting in high speed computation and low power consump-
with other methods. The proposed method can also improve the tion [8]. The disadvantage that limits the popularity of LUT-
speed and reduce the area of squarer circuit. The implementation based architecture in current systems is the exponential growth
and chip measurement results in 0.18-µm CMOS technology are
also presented and discussed. of LUT size as the operand width increases. Therefore, it is
reasonable to find a proper hybrid architecture that takes the
I. I NTRODUCTION advantage of both LUT-based circuit and conventional logic
circuits so that the good trade-off between squarer performance
High performance multiplier and squarer circuits are exten- and hardware complexity can be achieved. And in this paper,
sively used and play an essential role in modern and future we present a new approach for designing the fixed-width
digital signal processing (DSP) systems due to the increas- squarer circuit for specific DSP applications by employing an
ing demand of high speed applications. Especially, squaring improved hybrid LUT-based architecture to reduce the error,
is a fundamental arithmetic operation in many applications delay and hardware complexity. The main contribution of our
such as Viterbi decoding, digital filtering, vector quantiza- paper is that an improved architecture for low error and area-
tion, pattern recognition and image compression. However, delay efficient fixed-width squarer is proposed by employing
direct implementation of general parallel squarer circuits often a specific mathematic identity of squaring operation in a new
requires high hardware complexity and may lead to high approach with some improvements.
power consumption. Moreover, the high increase of mobile The rest of this paper is organized as follows. Section II
and portable devices leads to the emerging requirement of low introduces briefly about fixed-width squarer and section III
power consumption and low complexity digital circuits [1]. presents the proposed hybrid LUT-based architecture for fixed-
Therefore, recently, there have been many researches trying to width squarer. Section IV shows the error analysis results
find the efficient architectures for low area, low power and high and section V summaries the implementation and chip mea-
performance squarer circuit. Some proposed methods such as surement results of proposed squarer architecture using 0.18-
folded technique [2], merged technique [3] and K.-J. Cho et al. µm CMOS technology. Finally, the conclusion is included in
method [4] employ some mathematic identities to reduce the section VI.
hardware complexity and improve the performance of the full-
II. F IXED - WIDTH SQUARER
width squaring circuit. A. G. M. Strollo et al. [5] proposed an
improved method that combines Booth recoding with folded In this paper, we consider the square of W -bit binary
technique and some specific sub-circuits. number X and the full-width (2W -bit) square X 2 can be
On the other hand, in many DSP applications, when the written as:
squaring result is not required to be full-width but the same bit- (W ∑ −1 )2 W∑−1 W
∑ −1
i
width as the operand, the architecture for high performance, S=X = 2
xi 2 = xi xj 2i+j (1)
low complexity fixed-width squarer becomes an important i=0 i=0 j=0

topic with many issues needed to be considered such as error where xi and xj denote the binary bits of X. Fig. 1 shows the
compensation, hardware efficiency and squarer performance. original partial product matrix (PPM) for a general full width
Authors in [6] and [7] have proposed some methods to improve 8-bit squarer in which each partial product bit pij is computed
the accuracy and performance of the truncated and fixed-width by an AND logic operation as: pij = xi xj .
squarer circuits, but more improvements are desired, especially In a full-width squarer, all partial product bits in PPM
for future DSP systems which require high performance, are summed by using compression tree and adder [2]. Some

978-1-4673-2493-9/12/$31.00 ©2012 IEEE 477


p70 p60 p50 p40 p30 p20 p10 p00
p71 p61 p51 p41 p31 p21 p11 p01 W Lookup 2W
MSP X X2
p72 p62 p52 p42 p32 p22 p12 p02 Table
p73 p63 p53 p43 p33 p23 p13 p03
p74 p64 p54 p44 p34 p24 p14 p04 LSP
p75 p65 p55 p45 p35 p25 p15 p05
Fig. 2. General LUT-based full-width squarer.
p76 p66 p56 p46 p36 p26 p16 p06
p77 p67 p57 p47 p37 p27 p17 p07 IC
(15) (14) (13) (12) (11) (10) (9) (8) (7) (6) (5) (4) (3) (2) (1) (0) III. P ROPOSED HYBRID LUT- BASED FIXED - WIDTH
SQUARER
Fig. 1. Original partial product matrix of the 8-bit fixed-width squarer.
The LUT-based circuit provides the outputs by accessing the
pre-stored lookup table (LUT) other than actual computations.
methods proposed in [2]-[4] exploit some mathematical iden- This look-up only operation leads to high speed computation
tities of squaring operation to reduce the complexity of PPM because it mainly requires only time for accessing the pre-
summation. stored tables. Moreover, LUT-based circuits consume less
However, in a fixed-width squarer, the square result is re- dynamic power because of less bit-switching [12]. As shown
quired to have the same bit-width as the operand. For example, in Fig. 2, an LUT-based full-width squarer requires an LUT
in the 8-bit fixed-width squarer as shown in Fig. 1, the 8-bit of 2W × 2W bit to store 2W words with the word length of
square result corresponds to 8 maximum significant bits of 2W -bit. For the case of fixed-width squarer using this direct
the full-width square result. The most accurate fixed-width LUT-based architecture, an LUT of 2W × W bit is required
squaring method is the ideal rounding in which all partial to store ideally rounded values of squaring computation. The
products in PPM are summed and the full-width (2W -bit) maximum error in this ideally rounded direct LUT-based
result is rounded to provide the fixed-width (W -bit) result. For squarer is restricted by only 0.5 LSB (Least Significant Bit)
more mathematical convenience, as shown in Fig. 1, the PPM of the squaring result. However, this direct LUT-based imple-
is divided into three parts: MSP (Maximum Significant Part), mentation leads to exponentially growth of LUT size when
IC (Input Correction) and LSP (Least Significant Part). The the operand width increases. Therefore, some methods were
ideal rounding operation can be improved by adding ’1’ to the proposed to reduce the LUT size for LUT-based computation
IC column as the correction constant before summing and then design.
simple dropping some least significant bits of the full-width Pramod Kumar Meher [8] has proposed some methods
result to get fixed-width square result. Two above methods lead of the LUT optimization that can reduce the LUT size sig-
to high complexity and long delay because all partial products nificantly for LUT-based constant multiplier. However, these
of IC and LSP are generated and accumulated. Therefore, to methods can not be applied directly for the fixed-width squarer
reduce the hardware complexity and delay, in more improved because of the different PPMs and truncation operation re-
methods of fixed-width squarer, LSP is discarded and IC is quired in fixed-width squaring computation. And some im-
used to compensate the error caused by discarding the LSP. proved methods for fixed-width and truncated multipliers can
E. G. Walters III et al. [9] proposed an architecture for not be employed directly for design of fixed-width squarer
truncated squarer with variable correction scheme. Kyung- [11]. As a result, a more accurate and efficient architecture
Ju Cho et al. [6] presented the adaptive error compensation for fixed-width squarer is highly desired. And it is reasonable
method for fixed-width squarer with more mathematical anal- to make the good trade-off between LUT-size and squarer
ysis and modifications of Booth folding encoding method. performance by employing a hybrid LUT-based structure for
And V. Garofalo et al [7] proposed an improved method by squarer by properly using both LUT and conventional logic
applying the analytical technique presented in [10] and [11] circuits.
to find the sub-optimal linear compensation function to reduce Again, consider the square of the W -bit binary number
the error and number of partial products. The approximated as presented in (1). In this section, we apply the mathe-
linear function of f (IC) is based on the IC column so that matic characteristics presented in [13] and [14] with some
the approximated square Xt2 is computed as: modifications and improvements as described below. The
Xt2 ≃ SM SP + f (IC) + KR (2) novelty of our proposed approach is that a specific mathematic
identity of squaring operation presented in [13] and [14] is
where KR and SM SP denote the correction constant and employed in a new method to derive an efficient architecture
weighted sum of M SP , respectively. This method results for the fixed-width squarer circuit with lower error and higher
in the reduction of mean error from 5% to 20% compared area-delay efficiency. Let X = (xW −1 xW −2 . . . x1 x0 ) and
with previous methods and also improves the performance of Y = (yW −1 yW −2 . . . y1 y0 ) be two W -bit binary numbers
squarer circuit. satisfying that yi = xi , 0≤ i ≤ (W − 1), the difference D

478
A
Coarse
LUT
W-3
Adder Y2
Y
A+B
Fine
LUT

Fig. 4. Coarse/fine LUT splitting to further reduce the LUT size.

Fig. 3. Proposed hybrid LUT-based architecture for fixed-width squarer.


rounded result. Therefore, it is promising that this method can
result in low error squarer circuit. The results of error analysis
between squares of two numbers can be expressed as: will be shown in the next section.
The block diagram of the proposed fixed-width squarer
D = |X| − |Y | = −yn−1 2W (2W + 1) +
2 2
using the improved hybrid LUT-based architecture is shown in
|xW −2 xW −3 . . . x1 x0 0yW −2 yW −3 . . . y1 y0 1| (3) Fig. 3. The AMU (Address Mapping Unit) has the function of
where |X| denotes the absolute value of the binary number inverting (one’s complement) the input operand and generating
X. Therefore, when yW −1 = 0, i.e. xW −1 = 1, it is derived the (W -2)-bit address for the LUT. The LUT block is used
that: to provide the square values of its address input. The 4-input
multiplexer (MUX4) is used to generate the DH part as shown
D = |xW −2 xW −3 . . . x1 x0 0yW −2 yW −3 . . . y1 y0 1| (4) in table I by using two maximum significant bits (M SB2) as
Exploiting this mathematical property, we can derive the the selecting bits. Then an adder provides the final fixed-width
relationship between the square of the number X and its binary (W -bit) square result.
bits as shown in table I for the case of four-folding together In general, the LUT size is reduced to 2W −2 × (W − 3) bit
with the following equation: because it stores the rounding of 2(W − 2)-bit result of Y 2 .
However, for the designs with with W ≥ 10, the coarse/fine
2 2
|X| = |Y | + D (5) LUT splitting method [15] is also employed to further reduce
the LUT size as shown in Fig. 4. The input Y with the bit-
The four-folding method means that two maximum significant
width of (A + B) is decomposed into two components: one
bits of operand are used as selecting bits to form the squaring
with lower bit-width (A-bit) for addressing the coarse LUT and
result. Exploiting (5), to get the fixed-width result, the round-
another with the same bit-width of (A + B)-bit for addressing
ing operation of square can be approximated by summing the
the fine LUT. The coarse LUT has the same width but fewer
rounding results of its components as follow:
storage entries than traditional one, while the fine LUT has
2 2
roundW (|X| ) ≃ roundW (|Y | ) + roundW (D) (6) smaller width and the same storage entries as traditional one
to store the difference between the rounded Y 2 and coarse
in which roundW (•) denotes the operation of rounding to
LUT entries. An additional adder is needed to provide the
W -bit result. This approximation is optimal if one of the
result of Y 2 .
two rounding operations in right part of (5) is error-free. In
this section, we present a sub-optimal approximation method IV. E RROR ANALYSIS OF THE PROPOSED FIXED - WIDTH
using hybrid LUT architecture to reduce the error as much as SQUARER
possible.
In this paper, the mean error (M E), maximum error
As shown in table I, the component D is divided into two
(M axE) and mean square error (M SE) are used as error
parts: DH (high part) and DL (low part), each part has the
criteria for error analysis. The mean error and mean square
same bit length of W . Therefore, it is obvious that:
error are related to the power of the error and hence play very
D = |DH| 2W + |DL| (7) important role in DSP applications. However, the maximum
error becomes the most important concern in safety critical
As one can see in the fourth column of table I, the maximum
applications [16]. The error caused by approximation in (3)
significant bit of DL is always zero. Hence, the operation can
together with ME and MSE can be expressed as:
be approximated by discarding the DL part and the result of
this rounding operation can be approximated by DH part. 2
Er = X 2 − (roundW (|Y | ) + roundW (D)) (8)
The maximum error can be easily estimated as 1 LSB of
result compared with the ideal rounded result. Assume that M E = E{Er} (9)
2
the operation roundW (|Y | ) can be performed by the ideal M SE = E{Er2 } (10)
rounding to get W -bit result, the general maximum error of
the approximation in (3) will be 1 LSB compared with ideal where E{•} denotes the averaging operation.

479
TABLE I
F OUR - FOLDING METHOD FOR HYBRID LUT- BASED FIXED - WIDTH SQUARER .
M SB2 Y DH DL
00 xW −3 . . . x1 x0 000. . . . . . . . . . . . 00 (DH0) 00. . . . . . . . . . . . 00
01 xW −3 . . . x1 x0 00xW −3 . . . x1 x0 (DH1) 0xW −3 . . . x0 1
10 xW −3 . . . x1 x0 01xW −3 . . . x1 x0 (DH2) 00. . . . . . . . . 00
11 xW −3 . . . x1 x0 1xW −3 . . . x1 x0 0 (DH3) 0xW −3 . . . x0 1

TABLE II
E RROR ANALYSIS RESUTS FOR DIFFERENT FIXED - WIDTH SQUARER
ARCHITECTURES .

ME MaxE MSE
W Method
(LSB) (LSB) (LSB2 )
Walter et al. [9] 0.166 1.215 0.207
Garofalo et al. [12] 0.334 - 0.441
8
Garofalo et al. [7] 0.166 1.260 0.163
Proposed 0.100 0.106 0.105
Walter et al. [9] 0.166 - 0.226
Garofalo et al. [12] 0.333 - 0.529
10
Garofalo et al. [7] 0.166 1.263 0.181
Proposed 0.119 1.062 0.120
Walter et al. [9] 0.167 - 0.246
Garofalo et al. [12] 0.333 - 0.607
12
Garofalo et al. [7] 0.167 1.308 0.200
Proposed 0.118 1.103 0.119
Walter et al. [9] 0.167 - 0.266 Fig. 5. ADP results for different values of W .
Garofalo et al. [12] 0.333 - 0.693
14
Garofalo et al. [7] 0.167 1.365 0.220
Proposed 0.120 1.117 0.124
Walter et al. [9] 0.167 1.862 0.287
16
Garofalo et al. [12] 0.333 - 0.776 Synopsys Design tools using the same standard cell library
Garofalo et al. [7] 0.167 1.432 0.240 with the same design parameters and constrains for different
Proposed 0.122 1.120 0.125
values of W . For the designs with W ≥ 10, the coarse/fine
LUT splitting method is also employed to further reduce
TABLE III the LUT size as shown in Fig. 4. Table III shows the LUT
LUT PARAMETERS AND COMPRESSION RATIO OF THE PROPOSED
FIXED - WIDTH SQUARER DESIGNS WITH LUT SPLITTING TECHNIQUE .
parameter, size and compression ratio of different fixed-width
squarer designs using proposed hybrid LUT-based architecture
W 8 10 12 14 16 and coarse/fine LUT splitting method in which A and B denote
A - 5 7 8 9 the input widths of two split LUTs as depicted in Fig. 4. The
B - 3 3 4 5
compression ratio is calculated as the ratio between direct LUT
Proposed LUT (bits) 320 704 3072 10752 38912
size and proposed LUT size (in bits).
Direct LUT (bits) 2048 10240 49152 229376 1048576
Compression ratio 6.4:1 14.5:1 16:1 21.3:1 26.9:1 The implementation results in 0.18-µm CMOS technology
are presented in table IV and Fig. 5. It is shown that the
proposed fixed-width squarer can reduce the area and delay
significantly when compared with other methods. To compare
Table II summaries the error analysis results and the com- the overall squarer performance with different methods, the
parison between proposed squarer and other architectures. area-delay product (ADP) is used as a factor of merit for com-
It is shown that the proposed fixed-width squarer results in parison. The bold numbers in parentheses of the fifth column
lower error compared with others. The mean error and mean in table IV present the normalized ADP results of different
square error of proposed squarer can be reduced by up to fixed-width squarer designs with each value of W . Compared
40% compared with these values of V. Garofalo et al. method with the fixed-width squarer method as presented in [7], the
presented in [7] whereas the maximum error can be reduced by proposed hybrid LUT-based architecture leads to the reduction
up to 22%. The low error merit makes the proposed method of ADP up to 42%. Fig. 6 is the chip microphotograph of the
highly applicable for future DSP applications which require proposed 8-bit fixed-width squarer which is fabricated with
not only high speed and low power but also low error squarer 0.18-µm CMOS technology. Fig. 7 presents the chip functional
circuits. measurement using logic analyzer (right side) which is the
same as the simulation result in Modelsim software (left side).
V. I MPLEMENTATION RESULTS And table V shows the measurement results for average power
The proposed fixed-width squarer and other architectures consumption and longest (critical) path delay (between the
have been implemented in 0.18-µm CMOS technology by input x6 and the output s7 ) as depicted in Fig. 8.

480
TABLE IV 40 µ m
I MPLEMENTATION RESULTS OF DIFFERENT FIXED - WIDTH SQUARER
ARCHITECTURES USING 0.18-µ M CMOS TECHNOLOGY.

Area Delay
W Method ADP (x103 )
(103 µm2 ) (ns)
Folded 2.3 6.6 15.3 (1.34)
Direct LUT-based 4.0 4.8 19.2 (1.68)
8
Garofalo et al. [7] 1.9 5.9 11.4 (1.00)
Proposed 1.5 4.4 6.6 (0.58) 40 µ m
Folded 3.8 7.4 28.3 (1.33)
Direct LUT-based 19.6 6.2 121.7 (5.73)
10
Garofalo et al. [7] 3.3 6.5 21.2 (1.00)
Proposed 2.7 5.6 15.1 (0.71)
Folded 6.1 8.6 52.9 (1.20)
Direct LUT-based 48.8 6.9 336.7 (7.66)
12
Garofalo et al. [7] 5.4 8.2 43.9 (1.00)
Proposed 5.0 6.7 33.5 (0.76)
Folded 8.7 9.9 86.6 (1.23)
Direct LUT-based 98.9 7.9 762.4 (11.1) Fig. 6. Chip microphotograph of the proposed 8-bit fixed-width squarer.
14
Garofalo et al. [7] 7.7 9.1 70.5 (1.00)
Proposed 7.4 7.6 56.2 (0.80)
Folded 10.5 11.0 115.7 (1.21) X 7E 7F 80 81 82
16
Direct LUT-based 208.7 9.4 2024.4 (21.2) x7
Garofalo et al. [7] 9.2 10.4 95.4 (1.00) x6
Proposed 8.9 9.0 80.3 (0.84)
x5
x4
TABLE V x3
C HIP PARAMETERS OF THE PROPOSED 8- BIT FIXED - WIDTH SQUARER . x2
Technology 0.18-µm x1
Supply voltage (V DD) 1.8 V x0
Average power consumption 0.362 mW @ 50 MHz S 3E 3F 40 41 42
Longest path delay 4.7 ns s7
s6
s5
s4
VI. C ONCLUSION s3
In this paper, we have presented an efficient approach for de- s2
sign of the fixed-width squarer circuits with low error and high s1
area-delay efficiency. The improved hybrid architecture em- s0
ploys both LUT-based and simple conventional logic circuits to
achieve the good trade-off between the circuit performance and Fig. 7. Simulation result in Modelsim software (left) and measurement result
complexity. The implementation and chip measurement results in logic analyzer (right).
in 0.18-µm CMOS technology have been shown together with
some discussions. Compared with other methods presented in
literature for the fixed-width and truncated squaring circuits, [3] R. K. Kolagotla, N.R. Griesbach and H.R. Srinivas, “VLSI implementa-
tion of 350 MHz 0.35 um 8 bit merged squarer,” Electronic Letters, vol.
the proposed method not only reduces the error significantly, 34, no. 1, pp. 47-48, Jan. 1998.
but also improves the squarer performance and area efficiency. [4] K.-J. Cho and J.-G. Chung, “Parallel squarer design using pre-calculated
Therefore, it has the great potential to be applied in modern sums of partial products,” Electronic Letters, vol. 43, no. 25, pp. 1414-
1416, Dec. 2007.
and future DSP and multimedia systems. [5] A. G. M. Strollo and D. De Caro, “Booth folding encoding for high
performance squarer circuits,” IEEE Trans. Circuits and Systems II:
ACKNOWLEDGMENT Analog and Digital Signal Processing, vol. 50, no. 5, pp. 250-254, May
The chip presented in this paper was fabricated in the chip 2003.
[6] Kyung-Ju Cho and Jin-Gyun Chung, “Adaptive error compensation for
fabrication program of VLSI Design and Education Center low error fixed-width squarers,” IEICE Trans. Information and Systems,
(VDEC), The University of Tokyo, in collaboration with vol. E90-D, no. 3, pp. 621-626, Mar. 2007.
ROHM CO. LTD. [7] Valeria Garofalo, Marino Coppola, Davide De Caro, Ettore Napoli, Nicola
Petra, Antonio G. M. Strollo, “A novel truncated squarer with linear
R EFERENCES compensation function,” Proc. 2010 IEEE International Symposium on
Circuits and Systems (ISCAS), pp. 4157-4160, Jun. 2010.
[1] Anantha P. Chandrakasan, Samuel Sheng, and Robert W. Brodersen, [8] Pramod Kumar Meher, “LUT optimization for memory-based computa-
“Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. tion,” IEEE Trans. Circuits and Systems-II: Express Brief, vol. 57, no. 4,
27, no. 4, pp. 473-484, Apr. 1992. pp. 285-289, Apr. 2010.
[2] J. Pihl and E. Aas, “A multiplier and squarer generator for high per- [9] E. G. Walters III and M. J. Schulte, “Efficient function approximation
formance DSP applications,” Proc. 39th Midwest IEEE Symposium on using truncated multipliers and squarers,” Proc. 17th IEEE Symposium
Circuit and Systems, pp. 109-112, Aug. 1996. on Computer Arithmetic, pp. 232-239, Jun. 2005.

481
x6 s7

4.7 ns

Fig. 8. Longest path delay measurement result.

[10] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo,


“Truncated binary multipliers with variable correction and minimum
mean square error,” IEEE Trans. Circuits and Systems-I: Regular Papers,
vol. 57, no. 6, pp. 1312-1325, Jun. 2010.
[11] V. Garofalo, N. Petra, D. De Caro, A. G. M. Strollo and E.Napoli,
“Low error truncated multipliers for DSP applications,” Proc. 15th IEEE
International Conference on Electronics, Circuits and Systems (ICECS
2008), pp. 29-32, Sep. 2008.
[12] Pramod Kumar Meher, “LUT-based circuits for future wireless systems,”
Proc. 53rd IEEE International Midwest Symposium on Circuits and
Systems (MWSCAS), pp 696-699, Aug. 2010.
[13] Chin-Long Wey and Ming-Der Shieh, “Design of a high speed square
generator,” IEEE Trans. Computers, vol. 47, no. 9, pp. 1021-1026, Sep.
1998.
[14] Wei-Chang Tsai, Ming-Der Shieh, Wen-Chin Lin, and Chin-Long Wey,
“Design of square generator with small look-up table,” Proc. IEEE Asia
Pacific Conference on circuits and systems (APCCAS 2008), pp. 172-175,
Dec. 2008.
[15] Jouko Vankka, Digital synthesizers and transmitters for software radio,
chapter 9, Springer, 2005.
[16] V. Garofalo, N. Petra and E.Napoli, “Analytical calculation of the
maximum error for a family of truncated multipliers providing minimum
mean square error,” IEEE Trans. Computers, vol. 60, no. 9, pp. 1366-
1371, Sep. 2011.

482

You might also like