Professional Documents
Culture Documents
Accepted Manuscript
Journal of Circuits, Systems and Computers
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
DOI: 10.1142/S0218126621500808
To be cited as: Burhan Khurshid, Javeed Jeelani Khan, An Efficient Fixed-Point Mul-
tiplier based on CORDIC Algorithm, Journal of Circuits, Systems and
Computers, doi: 10.1142/S0218126621500808
This is an unedited version of the accepted manuscript scheduled for publication. It has been uploaded
in advance for the benefit of our customers. The manuscript will be copyedited, typeset and proofread
before it is released in the final form. As a result, the published copy may differ from the unedited
version. Readers should obtain the final version from the above link when it is published. The authors
are responsible for the content of this Accepted Article.
Manuscript (pdf) Click here to access/download;Manuscript
(pdf);MANUSCRIPT-REVISED.docx
T
Accepted manuscript to appear in JCSC
IP
CR
Journal of Circuits, Systems, and Computers
World Scientific Publishing Company
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
US
Burhan Khurshid†
Department of ECE, IUST,
Awantipora (J&K), India
burhan32.iust@gmail.com
Fixed-point multiplication is an important operation that is frequently used in many digital signal
DM
processing (DSP) applications. The operation is computationally intense and very often the
performance of multiplier determines the overall performance of DSP system. Evidently, a wide range
of approaches have been proposed for efficient implementation of fixed-point multipliers on different
hardware platforms. In this paper, we use COordinate Rotation DIgital Computer (CORDIC)
algorithm to perform fixed-point multiplication operation. The motivation for our approach is based
on the fact that CORDIC is a hardware-efficient algorithm, wherein accuracy can be traded-off for
performance. Our implementation targets field programmable gate arrays (FPGAs) and focusses on
exploiting the underlying general and specialized fabric to the fullest. Performance comparisons
against various traditional and recent approaches show that a substantial improvement is achievable
by using CORDIC based multipliers. We have also implemented a recently proposed convolution
architecture using CORDIC based multipliers. The results show that a proper choice of CORDIC
architecture can result in an improvement of performance parameters like resource utilization,
throughput and dynamic power. This, however, is achieved in lieu of a small cost in accuracy. Our
TE
analysis of an 8-stage CORDIC multiplier reports a mean absolute percentage error (MAPE) of 6.032-
a factor that reduces exponentially with increasing number of stages.
Keywords: CORDIC; FPGA; Carry4 primitives; Fixed-point Arithmetic; LUT.
1. Introduction
Fixed-point multipliers are the elements of choice when design and implementation of high
EP
†
Corresponding author.
C
1
AC
T
Accepted manuscript to appear in JCSC
IP
CR
2 Author Names
critical path delays. Performance versus accuracy trade-offs often compel designers to use
fixed-point multipliers in their realizations [4-5]. Thus their performance is crucial and will
affect the overall performance of the top-level design they are part of [6-7].
Efficient implementation of complex arithmetic operations has always been a challenge
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
for designers. Traditionally two approaches have been used: Look-up table (LUT) method
and Polynomial expansion. While the former approach demands large sized LUTs for
US
higher precision, the latter suffers from the problem of slow convergence. COordinate
Rotation DIgital Computer (CORDIC) represents a compromise between these two
methods, where the desired precision is achieved using relatively lesser number of LUTs.
This flexibility has enabled CORDIC to encapsulate a diversity of arithmetic functions
using a single basic set of recursive equations [8-9]. Some applications include
computation of Discrete Cosine Transform (DCT) [10], Fast Fourier Transform (FFT) [11-
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
12], Recursive Least Square (RLS) filtering [13], Singular Value Decomposition (SVD)
[14] etc.
FPGAs are often used as implementation platforms to perform high speed tasks that
cannot be achieved using conventional processors. With distinctive advantages like lower
Non-Recurring Engineering (NRE) costs, reconfigurable design approach, high integration
levels, post-production design verification [15] etc., FPGAs are fast moving from prototype
designing to low and medium volume productions [16-17]. The architectural organization
of FPGAs enable realizations that are distributed spatially. This enables to capture a huge
DM
modification to these traditional approaches. In [21] the authors propose a low power array
multiplier using multiplexer based full-adder cells. A related approach is reported in [22],
wherein the full-adder cell is modified to achieve reduced delays. Booth and Wallace
multipliers have also been modified for different performance parameters. In [23] the
authors come up with a new realization of Booth multiplier using ring oscillator. The
proposed architecture is bit serial in nature with a low power dissipation. A reduction in
EP
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 3
using carry-select adder and binary to excess-1 converter is reported in [26] and [27]. High
speed counter based Wallace multipliers have also been reported in [28]. Another approach
that has gained a lot of prominence in recent times is based on the usage of Vedic arithmetic
[29]. Vedic multipliers offer a high degree of parallelism that can be exploited to implement
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
US
sometimes fail to achieve the requisite performance demanded by some DSP applications.
CORDIC based computations can result in multiplier implementations wherein accuracy
can be traded off for performance. The use of CORDIC algorithm for computation of
complex, trigonometric and hyperbolic functions is quite frequent, however, very rarely
has it been used for computation of linear operations like multiplication. In this paper, we
experimentally show that multipliers based on CORDIC can out-perform the above
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
mentioned approaches in terms of different performance parameters. We have also used
CORDIC based multipliers for computation of 2D convolution and compared our results
with some recent work [33]. Additionally, some error analysis has also been done that gives
a clear understanding of performance versus accuracy trade-offs that are associated with
CORDIC multipliers.
The rest of the paper is organized as follows: Section 2 briefly discusses the CORDIC
algorithm. Section 3 discusses the CORDIC based multiplication process and the resulting
architectures thereof. Synthesis, implementation and analysis is carried out in section 4.
DM
Conclusions are drawn in section 5 which also discusses the future scope of the work.
References are listed at the end.
2. CORDIC Algorithm
Since its introduction in 1959 by Volder [34], the basic CORDIC algorithm has been
expanded and modified to encapsulate a wide range of arithmetic functions into a single
basic set of equations. Depending on the exact nature of function to be evaluated, the
algorithm can be defined under linear, circular and hyperbolic coordinates [35]. As far as
operation is concerned, there are two basic modes: the rotation mode and the vectoring
TE
mode. The rotation mode operates by first specifying a desired rotation angle. After every
iteration the aim is to diminish the magnitude of the residual angle. Ideally, the angle value
should reduce to zero. However, a trade-off between the number of iterations and the
accuracy will define the final magnitude of the residual angle. The vectoring mode aims at
aligning the resultant vector along the horizontal axis. This is achieved by rotating the input
vector through whatever angle is necessary to align it along the horizontal axis. Again, a
trade-off between the number of iterations and accuracy will define the angular
EP
IP
CR
4 Author Names
𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛−1 (2−𝑘 ) 𝑖𝑓 𝑛 = 1
𝛩𝑘+1 = {𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛ℎ−1 (2−𝑘 )} 𝑖𝑓 𝑛 = −1 (3)
𝛩𝑘 − 𝜎𝑘 . (2−𝑘 ) 𝑖𝑓 𝑛 = 0
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Where,
+1 𝑓𝑜𝑟 𝐶𝑖𝑟𝑐𝑢𝑙𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
𝑛= 0 𝑓𝑜𝑟 𝐿𝑖𝑛𝑒𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
US
−1 𝑓𝑜𝑟 𝐻𝑦𝑝𝑒𝑟𝑏𝑜𝑙𝑖𝑐 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 5
𝛩𝑠 = −𝜃𝑁−1 + ∑𝑁−1
𝑖=1 𝜃𝑁−1 . 2
−𝑖
(14)
The product Bm is truncated product, i.e. N-1 lower order bits are discarded. Therefore,
𝐵𝑚 = 𝑏𝑁−1 . 𝑏𝑁−2 𝑏𝑁−3 … … 𝑏1 𝑏0 (15)
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
The CORDIC multiplier equations given in 4, 5 and 6 are iterative in nature. A direct
US
mapping of these equations on hardware will result in a word-serial architecture. Fig. 1
shows the top-level schematic of a word-serial CORDIC multiplier. Such a realization
requires lesser on-chip resources but puts a severe limit on the throughput of the multiplier.
The iterative multiplier equations can be easily unfolded into a multi-stage realization,
wherein the individual iterations are represented by separate stages. Such a realization has
two advantages. First, unlike serial architecture where the shifters need to be updated after
every iteration, the shifters are fixed in unfolded realization. These shifters can be easily
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
implemented in FPGA wiring, thereby resulting in reduction of resources utilized. Second,
unfolded architectures can be easily pipelined by placing registers along the feed-forward
paths. This improves the throughput of the multiplier. Fig. 2 shows the unfolded CORDIC
multiplier. The dotted lines represent the pipeline stages.
D
DM
AK+1
D
SHIFTER >> k
BK+1
+/-
D
ΘK+1
+/-
TE
2-k
As
>>1 >>2 >>3 >>4 >>m-1 >>m
Bs Bm
EP
σ0 σ1 σ2 σ3 σm-2 σm-1
Θs
+/- +/- +/- +/- +/- +/-
IP
CR
6 Author Names
considered FPGAs that have 6-input LUTs as their basic logic element. Specifically,
Virtex-5 FPGAs from Xilinx have been considered for implementation. Apart from the
US
general LUT fabric these devices also include specialized Carry4 primitives that speed up
the carry propagation encountered in many arithmetic operations. Carry4 primitive is
basically a fast 4-bit carry chain based on Look Ahead Carry logic. The primitive allows
FPGAs to efficiently implement arithmetic operations that involve propagation of carry
within a logic cell.
Detailed implementation has been carried out by realizing serial and unfolded
architectures of CORDIC based multiplier for varying operand word-lengths. The unfolded
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
CORDIC multiplier is designed in eight stages, first using only conventional LUT fabric
and then using a combination of LUTs and Carry4 primitives. Both combinational and
pipelined realizations have been considered. Performance has been specified in terms of
three parameters viz. resources utilized, timing and power dissipation. Resources include
the number of LUTs, flip-flops and logic slices used. Timing gives the notion of speed. For
combinational realizations, timing analysis mainly focuses on analyzing the paths from
input to output. The result is usually quoted as a single metric that corresponds to the
combinational delay of the critical path. For pipelined realizations, timing analysis is
DM
concerned with the maximum operating frequency of the structure. Both critical path delay
and maximum operating frequency are used to obtain the throughput of a particular
structure. To give a realistic picture of the performance, post Place and Route (PAR) timing
analysis under constrained environments has been done. Post PAR timing analysis also
enables the designers to capture a realistic picture of the switching activity that occurs
along different nodes in a routed design. The same is used to assess the dynamic power
dissipation of an implemented design.
We have compared our multiplier realizations against various traditional approaches
reported in [18-20]. These include the Basic Array Multiplier (BAM), Carry Save
TE
Multiplier (CSM), Carry Ripple Multiplier (CRM), Wallace Tree Multiplier (WTM),
Vedic Multiplier (VM) and three different types of Signed Booth Multiplier (BSM-I, BSM-
II, BSM-III). Additionally, some recent multiplier realizations have also been considered.
These include the multiplexer based array multiplier (MUX-Array) [21], Ring Oscillator
based Booth Multiplier (RO-Booth) [23], Probabilistic Booth Multiplier (Prob.-Booth)
[25], Square Root Carry Select Adder based Wallace Multiplier (SRCSA-Wallace) [26]
and a High-Speed Vedic Multiplier (HS-Vedic) [31]. We have also implemented the
EP
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 7
multipliers have lesser resource utilization when compared to other realizations. Owing to
its iterative nature, the CORDIC serial multiplier (CO-Ser.) shows the minimum LUT and
US
slice count as the resources are being shared among multiple iterations. The unfolded (CO-
Unf.) and pipelined (CO-Pip.) architectures have a higher resource count than the serial
architecture. The resource count can be reduced by using Carry4 primitives in the synthesis
process. The resulting architectures (CO-Unf.-Cry4. and CO-Pip.-Cry4) have a reduced
LUT and slice count. Further analysis is carried out by comparing the CORDIC based
multipliers against some recent multiplier realizations for varying operand word-lengths.
The results are shown in fig. 3 and fig.4.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
Fig. 5 shows the LUT utilization of the convolution architecture proposed in [33] using
Xilinx IP core multiplier (XIP-Core) and CORDIC based multipliers as the kernel size is
varied. Again, serial CORDIC based convolution architecture utilizes the minimum LUT
resources. It should be noted that the Xilinx IP core multiplier can be realized using LUTs
or DSP blocks. To ensure a fair comparison we have used LUT based Xilinx IP core
multiplier in our analysis. Further, the IP core multiplier is implemented with an optimal
level of pipelining. This adds to the number of flip-flops used by the convolution
architecture. Evidently, while the serial CORDIC based architecture shows the minimum
DM
flip-flop utilization, the unfolded CORDIC based architectures also have lesser flip-flop
count than the Xilinx IP core based architecture. This is shown in fig. 6. Fig. 7 shows the
variation in slice count for different realizations as the kernel size is varied. The variation
trend is similar to that of fig.5.
IP
CR
8 Author Names
US
CO-Pip. [this work] 270 332 98
1600 MUX-Array
RO-Booth Resource Utilization
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
1400
1200
Prob.-Booth
SRCSA-Wallace
HS-Vedic
CO-Ser.
CO-Unf.
AN
No. of LUTs
1000
CO-Pip.
CO-Unf.-Cry4.
800
CO-Pip.-Cry4.
600
400
DM
200
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Word-length
Fig. 3 LUT utilization for different multiplier realizations.
MUX-Array
450 RO-Booth Resource Utilization
Prob.-Booth
SRCSA-Wallace
TE
375
HS-Vedic
CO-Ser.
300 CO-Unf.
No. of Slices
CO-Pip.
CO-Unf.-Cry4.
225 CO-Pip.-Cry4.
150
EP
75
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Word-length
Fig. 4 Logic Slice utilization for different multiplier realizations.
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 9
2400
Resource Utilization
2200
XIP-Core
2000
CO-Ser
1800 CO-Unf.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
CO-Pip.
1600
CO-Unf.-Cry4.
No. of LUTs
US
1400
CO-Pip.-Cry4.
1200
1000
800
600
400
200
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
0
3×3
AN 5×5 7×7
Filter-Size
9×9
2200
2000
Resource Utilization
XIP-Core
1800
CO-Ser
DM
1600 CO-Unf.
CO-Pip.
No. of Flip-Flops
1400
CO-Unf.-Cry4.
1200 CO-Pip.-Cry4.
1000
800
600
400
200
0
TE
Filter-Size
Fig. 6 Flip-Flop utilization for different convolution architecture realizations.
EP
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
10 Author Names
CO-Pip.
CO-Unf.-Cry4.
No. of Slices
US
600 CO-Pip.-Cry4.
400
200
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
0
3×3
AN 5×5 7×7
Filter-Size
9×9 11×11
throughput is simply the inverse of the delay associated with the critical path. For
synchronous sequential circuits, throughput is determined by the maximum frequency at
which the circuit can be clocked. Table 2 lists the timing metrics of the different multiplier
architectures for an operand word-length of 16 bits. It is observed that CORDIC based
multipliers have comparatively smaller critical path delays. The critical path delays can be
further reduced by pipelining the unfolded architectures. Pipelining also results in an
interleaved operation, thereby increasing the throughput of the multiplier. The highest
clock frequency is achieved with serial architectures. This includes the ring oscillator based
Booth multiplier and CORDIC serial multiplier. However, owing to their serial nature the
resulting throughput is much less. It should, however, be noted that CORDIC serial
TE
multiplier is a word-serial architecture and the throughput will be limited by the number of
iterations. Thus, any increase in the operand word-length will have no impact on the overall
throughput of the multiplier. This is shown in fig. 8 where throughput analysis of CORDIC
based multiplier and some recently proposed multipliers is done for varying operand word-
lengths. The CORDIC serial multiplier exhibits a flat throughput response. This is
advantageous for large operand word-length multipliers which are used in some DSP
EP
applications. Table 3 lists the throughput of different multipliers for an operand word-
length of 128 bits. The CORDIC serial multiplier achieves the maximum throughput.
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 11
Multiplier Design Critical Path (nS) Max. Clock Freq. (MHz) Throughput (MHz)
US
CRM [18-20] 29.83 -- 33.524
BSM-III [18-20]
MUX-Array [21]
RO-Booth [23]
Prob.-Booth [25]
AN 16.01
22.78
4.224
15.554
--
--
521.98
--
62.46
43.9
32.63
64.3
MUX-Array
400 Throughput RO-Booth
Prob.-Booth
TE
350 SRCSA-Wallace
HS-Vedic
300 CO-Ser.
Frequency (MHz)
CO-Unf.
250 CO-Pip.
CO-Unf.-Cry4.
200 CO-Pip.-Cry4.
150
EP
100
50
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80
Word-length
Fig. 8 Throughput variation for different multiplier realizations
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
12 Author Names
Table 3. Throughput comparison of different multipliers for an operand word-length of 128 bits
US
Prob.-Booth [25] 132.72 7.53
AN
CO-Pip. [this work]
117.46
23.42
30.93
8.5
42.7
Fig. 9 shows the throughput variation of the convolution architecture based on Xilinx IP
core multiplier and different CORDIC based multipliers as the kernel size is varied. It is
observed that convolution architecture based on the pipelined Carry4 CORDIC multiplier
DM
has the highest throughput. As mentioned previously, Xilinx IP core multiplier is
implemented with an optimal level of pipelining. Thus any latency concerns that are
prevalent in pipelined CORDIC multiplier based architecture will also exist in the Xilinx
IP based architecture. However, for a real time application, throughput is more important
a parameter than latency. Another interesting analysis is with regards to the critical path of
different architectures. Fig. 10 shows the critical path variation as a function of kernel size.
The Xilinx IP core based architecture has comparatively more critical path than the serial
and pipelined CORDIC based architectures. This is one of the drawbacks of using hard IP
cores. These cores are always fixed in the FPGA fabric, thereby, increasing the cost of
routing data to and from these cores. While this may not affect the throughput of the system,
TE
it increases the physical capacitance associated with these routes, resulting in greater power
dissipation as discussed in the next section.
EP
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 13
XIP-Core
CO-Ser
240 CO-Unf.
Throughput CO-Pip.
220
CO-Unf.-Cry4.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
200 CO-Pip.-Cry4.
180
US
Frequency (MHz)
160
140
120
100
80
60
40
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
20
0
3×3
AN 5×5 7×7
Filter-Size
9×9 11×11
Critical Path
14 XIP-Core
CO-Ser
DM
12 CO-Unf.
CO-Pip.
10 CO-Unf.-Cry4.
CO-Pip.-Cry4.
Delay (nSec.)
2
TE
0
3×3 5×5 7×7 9×9 11×11
Filter-Size
Fig. 10 Critical path variation for different convolution architecture realizations
For power analysis all the multipliers are implemented using synchronous design practices.
Such an approach involves use of registers at the input and the output of the multiplier.
Power dissipation is a strong function of the clock frequency. It also has a direct relation
with the physical capacitances along different nodes and routes within a design. To ensure
a fair comparison, power analysis is done for a nominal clock frequency of 100 MHz for
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
14 Author Names
all the multipliers. Table 4 lists the dynamic power dissipation of different multiplier
realizations for an operand word-length of 16 bits. Serial multiplier architectures (RO-
Booth and CO-Ser.) show the least power dissipation as they utilize the minimum
underlying resources, thereby reducing the logic power dissipation. Additionally, these
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
multipliers also have smaller critical paths and thus lesser route capacitances. This further
reduces the dynamic power dissipation. Unfolded CORDIC multipliers also have lesser
US
power dissipation because of the reduced usage of logic resources. Pipelining further
reduces the power dissipation by breaking the critical paths, resulting in reduced
capacitances associated with these routes. Further analysis plots the dynamic power
dissipation as a function of operand word-length. The results are shown in fig. 11. Similar
trends are observed in convolution architectures based on different multipliers.
Architectures based on serial and pipelined multiplier realizations (XIP-Core, CO-Ser.,
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
CO-Pip., CO-Pip. Cry4.) have comparatively lesser power dissipation. The CORDIC serial
multiplier based architecture has the least power dissipation as it utilizes fewer underlying
logic resources. The results are shown in fig. 12.
VM [18-20] 37.82
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 15
150
MUX-Array
RO-Booth
Dynamic Power Dissipation
Prob.-Booth
120 SRCSA-Wallace
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
HS-Vedic
CO-Ser.
CO-Unf.
Power (mW)
US
90
CO-Pip.
CO-Unf.-Cry4.
CO-Pip.-Cry4.
60
30
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
0
0 4 8 12 AN
16 20 24 28 32 36
Word-length
40 44 48 52
70
Dynamic Power Dissipation
XIP-Core
60
DM
CO-Ser
CO-Unf.
50 CO-Pip.
Power (mW)
CO-Unf.-Cry4.
40 CO-Pip.-Cry4.
30
20
10
TE
0
3×3 5×5 7×7 9×9 11×11
Filter-Size
Fig. 12 Dynamic Power dissipation for different convolution architecture realizations
IP
CR
16 Author Names
performance will, therefore, determine the proper choice of CORDIC multiplier for a
particular application. Fig. 14 shows the increase in LUT resources for a reduced MAPE
US
for an unfolded Carry4 based CORDIC multiplier with an operand word-length of 16 bits.
Similar analysis plots the reduction in throughput for a reduced MAPE for a serial
CORDIC multiplier. The results are shown in fig. 15.
6.75
6.00
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
5.25
AN
4.50
3.75
MAPE
3.00
2.25
1.50
0.75
0.00
7 8 9 10 11 12 13 14 15 16 17
No. of Stages/Iterations
400
350
No. of LUTs
300
250
200
150
0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75
TE
MAPE
84
77
Throughput (MHz)
70
EP
63
56
49
42
0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75
MAPE
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 17
US
between performance and accuracy will determine the proper choice of CORDIC
architecture for a particular application. Our analysis also reveals that the performance
speed-up using CORDIC based multipliers can also be propagated to more complex
operations like convolution. Our future endeavors will focus on using CORDIC algorithm
to perform multiply-accumulate operation. The same can be used as a building block to
realize efficient high-performance filter structures.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
Acknowledgments
AN
This work was carried out under the seed grant initiative of TEQIP-III project. The authors
are grateful to the TEQIP-III project team of IUST for their assistance and financial support
during the entire course of study.
References
1. G. L. Narayan and B. Venkataramani, Optimization Techniques for FPGA based Wave
Pipelined DSP Blocks, IEEE Transc. Very Large Scale Integr. (VLSI) syst., 13 (2005), 783-792.
DM
2. M. A. Ashour and H. I. Saleh, An FPGA Implementation guide for some different types of
Serial-Parallel Multiplier Structures, Microelectronics Journal, 31 (2000), 161-168.
3. K. Compton, S. Hauck, Reconfigurable Computing: A survey of Systems and Software, ACM
Computing Surveys, 34 (2002), 171-210.
4. IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standards Board, (2018).
5. Technical Report ANSI/IEEE Std. 754-1985, the Institute of Electrical and Electronics
Engineers (1985).
6. C. Inacio, D. Ombres, The DSP decision: Fixed point or floating? IEEE Spectrum, 33 (1996),
72-74.
7. R. Tessier and W. Burleson, Reconfigurable Computing for DSP: A Survey, Journal of VLSI
Signal Processing, 28 (2001), 7-27.
8. S. Hauck and A. Dehon, Reconfigurable Computing: The Theory and Practice of FPGA-based
TE
12. A. M. Despain, Fourier Transform Computers using CORDIC Iterations, IEEE Transactions on
Computers, 23, (1974), 993-1001.
13. B. Haller, J. Gotze, J. Cavallaro, Efficient Implementation of Rotation Operations for high-
performance QRD-RLS filtering, Proceedings of the International Conference on Application
Specific Systems, Architectures and Processors, (1997).
14. J. R. Cavallaro, F. T. Luk, CORDIC Arithmetic for an SVD Processor, Journal of Parallel and
Distributed Computing, (1988), 271-290.
C
AC
T
Accepted manuscript to appear in JCSC
IP
CR
18 Author Names
US
17.
compressor cells, Journal of VLSI Signal Processing, 31 (2002), 77-89.
18. S. Bhattacharjee, S. Sil, B. Basak and A. Chakarbarti, Evaluation of Power Efficient Adder and
Multiplier Circuits for FPGA based DSP Applications, Proceedings of the International
Conference on Communication and Industrial Applications (ICCIA), (2011)
19. B. Khurshid and R. Naaz, Technology Optimized Fixed-Point Bit-Parallel Multiplier for LUT
based FPGAs, International Journal of High Performance Systems Architecture, 6 (2016), 28-
35.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
21.
22.
AN
Multipliers: Advancement and Comparison, IIOAB Journal, 09 (2018), 54-66.
S. Srikanth, I. T. Banu, G. V. Priya and G. Usha, Low power array multiplier using modified
full adder, IEEE International Conference on Engineering and Technology (ICETECH),
Coimbatore, (2016), 1041-1044.
S. K. Sahoo and C. Shekhar, Delay optimized array multiplier for signal and image processing,
International Conference on Image Information Processing, Shimla, (2011), 1-4.
23. D. Okamoto, M. Kondo, T. Yokogawa, Y. Sejima, K. Arimoto and Y. Sato, A Serial Booth
Multiplier Using Ring Oscillator, Fourth International Symposium on Computing and
Networking (CANDAR), Hiroshima, (2016), 458-461.
DM
24. R. Shrestha and U. Rastogi, Design and Implementation of Area-Efficient and Low-Power
Configurable Booth-Multiplier, 29th International Conference on VLSI Design and 2016 15th
International Conference on Embedded Systems (VLSID), Kolkata, (2016), 599-600.
25. M. V. Durga Pavan and S. R. Ramesh, An Efficient Booth Multiplier Using Probabilistic
Approach, International Conference on Communication and Signal Processing (ICCSP),
Chennai, (2018), 365-368.
26. R. B. S. Kesava, B. L. Rao, K. B. Sindhuri and N. U. Kumar, Low Power and Area Efficient
Wallace Tree Multiplier using Carry Select Adder with Binary to Excess-1 Converter,
Conference on Advances in Signal Processing (CASP), Pune, (2016), 248-253.
27. D. Paradhasaradhi, M. Prashanthi and N. Vivek, Modified Wallace Tree Multiplier using
Efficient Square Root Carry Select Adder, International Conference on Green Computing
TE
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 19
US
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
AN
DM
TE
EP
C
AC