10 1142@S0218126621500808

Accepted manuscript to appear in JCSC
Accepted Manuscript
Journal of Circuits, Systems and Computers
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Article Title: An Efficient Fixed-Point Multiplier based on CORDIC Algorithm
Author(s): Burhan Khurshid, Javeed Jeelani Khan
DOI: 10.1142/S0218126621500808
Received: 26 September 2019

J CIRCUIT SYST COMP Downloaded from www.worldscientific.com
Accepted: 14 July 2020
To be cited as: Burhan Khurshid, Javeed Jeelani Khan, An Efficient Fixed-Point Mul-
tiplier based on CORDIC Algorithm, Journal of Circuits, Systems and
Computers, doi: 10.1142/S0218126621500808
Link to final version: https://doi.org/10.1142/S0218126621500808
This is an unedited version of the accepted manuscript scheduled for publication. It has been uploaded
in advance for the benefit of our customers. The manuscript will be copyedited, typeset and proofread
before it is released in the final form. As a result, the published copy may differ from the unedited
version. Readers should obtain the final version from the above link when it is published. The authors
are responsible for the content of this Accepted Article.
Manuscript (pdf) Click here to access/download;Manuscript
(pdf);MANUSCRIPT-REVISED.docx
T
IP
CR
Journal of Circuits, Systems, and Computers
 World Scientific Publishing Company
An Efficient Fixed-Point Multiplier based on CORDIC Algorithm *
US
Burhan Khurshid†
Department of ECE, IUST,
Awantipora (J&K), India
burhan32.iust@gmail.com
Javeed Jeelani Khan

Department of ECE, IUST,
AN Awantipora (J&K), India

javeedjeelanikhan@gmail.com
Received (Day Month Year)

Revised (Day Month Year)
Accepted (Day Month Year)
Fixed-point multiplication is an important operation that is frequently used in many digital signal
DM
processing (DSP) applications. The operation is computationally intense and very often the
performance of multiplier determines the overall performance of DSP system. Evidently, a wide range
of approaches have been proposed for efficient implementation of fixed-point multipliers on different
hardware platforms. In this paper, we use COordinate Rotation DIgital Computer (CORDIC)
algorithm to perform fixed-point multiplication operation. The motivation for our approach is based
on the fact that CORDIC is a hardware-efficient algorithm, wherein accuracy can be traded-off for
performance. Our implementation targets field programmable gate arrays (FPGAs) and focusses on
exploiting the underlying general and specialized fabric to the fullest. Performance comparisons
against various traditional and recent approaches show that a substantial improvement is achievable
by using CORDIC based multipliers. We have also implemented a recently proposed convolution
architecture using CORDIC based multipliers. The results show that a proper choice of CORDIC
architecture can result in an improvement of performance parameters like resource utilization,
throughput and dynamic power. This, however, is achieved in lieu of a small cost in accuracy. Our
TE
analysis of an 8-stage CORDIC multiplier reports a mean absolute percentage error (MAPE) of 6.032-
a factor that reduces exponentially with increasing number of stages.
Keywords: CORDIC; FPGA; Carry4 primitives; Fixed-point Arithmetic; LUT.
1. Introduction
Fixed-point multipliers are the elements of choice when design and implementation of high
EP
performance DSP hardware is concerned [1-3]. Fixed-point representation may be

considered as a degenerate case of floating-point representation, in the sense that the
exponent is fixed and does not vary with time. Such a representation is apt for FPGAs as it
results in the reduction of the number of logic levels, thereby limiting the end-to-end
†
Corresponding author.
C
1
AC
T
IP
CR
2 Author Names
critical path delays. Performance versus accuracy trade-offs often compel designers to use
fixed-point multipliers in their realizations [4-5]. Thus their performance is crucial and will
affect the overall performance of the top-level design they are part of [6-7].
Efficient implementation of complex arithmetic operations has always been a challenge
for designers. Traditionally two approaches have been used: Look-up table (LUT) method
and Polynomial expansion. While the former approach demands large sized LUTs for
US
higher precision, the latter suffers from the problem of slow convergence. COordinate
Rotation DIgital Computer (CORDIC) represents a compromise between these two
methods, where the desired precision is achieved using relatively lesser number of LUTs.
This flexibility has enabled CORDIC to encapsulate a diversity of arithmetic functions
using a single basic set of recursive equations [8-9]. Some applications include
computation of Discrete Cosine Transform (DCT) [10], Fast Fourier Transform (FFT) [11-
AN
12], Recursive Least Square (RLS) filtering [13], Singular Value Decomposition (SVD)
[14] etc.
FPGAs are often used as implementation platforms to perform high speed tasks that
cannot be achieved using conventional processors. With distinctive advantages like lower
Non-Recurring Engineering (NRE) costs, reconfigurable design approach, high integration
levels, post-production design verification [15] etc., FPGAs are fast moving from prototype
designing to low and medium volume productions [16-17]. The architectural organization
of FPGAs enable realizations that are distributed spatially. This enables to capture a huge
DM
amount of parallelism that is available in many hardware efficient algorithms like

CORDIC. Thus FPGAs serve as the platforms of choice when implementation of such
algorithms is concerned. Another important feature of modern FPGAs is the introduction
of specialized primitives and macroblocks. The inclusion of these special blocks in the
design process considerably speeds up the performance aspects of the top-level design.
A large number of traditional approaches towards multiplication have been extended
to the implementation of fixed-point multipliers. Prominent among these include Array
multipliers, Carry ripple multipliers, Carry save multipliers, Booth multipliers, Wallace
tree multipliers, Vedic multipliers etc. A detailed discussion about these multipliers can be
found in [18-20]. Owing to the increased demand of performance, there has been a lot of
TE
modification to these traditional approaches. In [21] the authors propose a low power array
multiplier using multiplexer based full-adder cells. A related approach is reported in [22],
wherein the full-adder cell is modified to achieve reduced delays. Booth and Wallace
multipliers have also been modified for different performance parameters. In [23] the
authors come up with a new realization of Booth multiplier using ring oscillator. The
proposed architecture is bit serial in nature with a low power dissipation. A reduction in
EP
power dissipation accompanied by reduction in resource utilization through clock gating

and resource sharing in Booth multiplier is reported in [24]. While the approaches in [23]
and [24] are exact, a probabilistic approach towards Booth multiplication is reported in
[25]. The authors replace the conventional truncation circuitry with a probabilistic
estimation bias circuit and claim to achieve better resource utilization and reduced power
dissipation. Similar, low-power and area-efficient realizations of Wallace tree multipliers
C
AC
T
IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 3
using carry-select adder and binary to excess-1 converter is reported in [26] and [27]. High
speed counter based Wallace multipliers have also been reported in [28]. Another approach
that has gained a lot of prominence in recent times is based on the usage of Vedic arithmetic
[29]. Vedic multipliers offer a high degree of parallelism that can be exploited to implement
fast parallel multipliers. Some noteworthy contributions are reported in [30-32].

While all the above mentioned approaches have a high degree of accuracy, they
US
sometimes fail to achieve the requisite performance demanded by some DSP applications.
CORDIC based computations can result in multiplier implementations wherein accuracy
can be traded off for performance. The use of CORDIC algorithm for computation of
complex, trigonometric and hyperbolic functions is quite frequent, however, very rarely
has it been used for computation of linear operations like multiplication. In this paper, we
experimentally show that multipliers based on CORDIC can out-perform the above
AN
mentioned approaches in terms of different performance parameters. We have also used
CORDIC based multipliers for computation of 2D convolution and compared our results
with some recent work [33]. Additionally, some error analysis has also been done that gives
a clear understanding of performance versus accuracy trade-offs that are associated with
CORDIC multipliers.
The rest of the paper is organized as follows: Section 2 briefly discusses the CORDIC
algorithm. Section 3 discusses the CORDIC based multiplication process and the resulting
architectures thereof. Synthesis, implementation and analysis is carried out in section 4.
DM
Conclusions are drawn in section 5 which also discusses the future scope of the work.
References are listed at the end.
2. CORDIC Algorithm
Since its introduction in 1959 by Volder [34], the basic CORDIC algorithm has been
expanded and modified to encapsulate a wide range of arithmetic functions into a single
basic set of equations. Depending on the exact nature of function to be evaluated, the
algorithm can be defined under linear, circular and hyperbolic coordinates [35]. As far as
operation is concerned, there are two basic modes: the rotation mode and the vectoring
TE
mode. The rotation mode operates by first specifying a desired rotation angle. After every
iteration the aim is to diminish the magnitude of the residual angle. Ideally, the angle value
should reduce to zero. However, a trade-off between the number of iterations and the
accuracy will define the final magnitude of the residual angle. The vectoring mode aims at
aligning the resultant vector along the horizontal axis. This is achieved by rotating the input
vector through whatever angle is necessary to align it along the horizontal axis. Again, a
trade-off between the number of iterations and accuracy will define the angular
EP
displacement between the final vector and the horizontal axis.

Both operating modes can be captured using a single set of unified equations:
𝐴𝑘+1 = 𝐴𝑘 − 𝑛. 𝜎𝑘 . 𝐵𝑘 . 2−𝑘 (1)
𝐵𝑘+1 = 𝐵𝑘 + 𝜎𝑘 . 𝐴𝑘 . 2−𝑘 (2)
C
AC
T
IP
CR
4 Author Names
𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛−1 (2−𝑘 ) 𝑖𝑓 𝑛 = 1
𝛩𝑘+1 = {𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛ℎ−1 (2−𝑘 )} 𝑖𝑓 𝑛 = −1 (3)
𝛩𝑘 − 𝜎𝑘 . (2−𝑘 ) 𝑖𝑓 𝑛 = 0
Where,
+1 𝑓𝑜𝑟 𝐶𝑖𝑟𝑐𝑢𝑙𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
𝑛= 0 𝑓𝑜𝑟 𝐿𝑖𝑛𝑒𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
US
−1 𝑓𝑜𝑟 𝐻𝑦𝑝𝑒𝑟𝑏𝑜𝑙𝑖𝑐 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
A and B are coordinate vectors,

Θ is the angle vector,
σ determines the direction of rotation in the next iteration
By operating the unified equations in different coordinate systems, different arithmetic
functions can be evaluated.

AN
3. CORDIC based Multiplier
Section 2 discussed the unified CORDIC equations for the linear, circular and hyperbolic
coordinate systems. For linear coordinate system, n = 0 and the basic set of iterative
equations become:
𝐴𝑘+1 = 𝐴𝑘 (4)
DM
𝐵𝑘+1 = 𝐵𝑘 + 𝜎𝑘 . 𝐴𝑘 . 2−𝑘 (5)

𝛩𝑘+1 = 𝛩𝑘 − 𝜎𝑘 . (2−𝑘 ) (6)
After m iterations the resultant equations would be:
𝐴𝑚 = 𝐴𝑠 (7)
𝐵𝑚 = 𝐵𝑠 + 𝐴𝑠 . 𝛩𝑠 (8)
𝛩𝑚 = 0 (9)
If B vector has an initial value Bs = 0, then after m iterations the B vector will give the
TE
product of As and Θs vectors i.e. for Bs = 0:

𝐵𝑚 = 𝐴𝑠 . 𝛩𝑠 (10)
Equation 10 can be used to perform the multiply operation of the vectors As and Θs. Since
CORDIC operates using simple shift and add operations, the mapping of equation 10 on
hardware can provide an efficient multiplier block. Note that the multiplication is fixed
point. Therefore, N-bit operands As and Θs may be represented as:
EP
𝐴𝑠 = 𝑎𝑁−1 . 𝑎𝑁−2 𝑎𝑁−3 … … 𝑎1 𝑎0 (11)

𝛩𝑠 = 𝜃𝑁−1 . 𝜃𝑁−2 𝜃𝑁−3 … … 𝜃1 𝜃0 (12)
The magnitude of these numbers lies in the range [-1, 1) and is given by:
𝐴𝑠 = −𝑎𝑁−1 + ∑𝑁−1
𝑖=1 𝑎𝑁−1 . 2
−𝑖
(13)
C
AC
T
IP
CR
𝛩𝑠 = −𝜃𝑁−1 + ∑𝑁−1
𝑖=1 𝜃𝑁−1 . 2
−𝑖
(14)
The product Bm is truncated product, i.e. N-1 lower order bits are discarded. Therefore,
𝐵𝑚 = 𝑏𝑁−1 . 𝑏𝑁−2 𝑏𝑁−3 … … 𝑏1 𝑏0 (15)
The CORDIC multiplier equations given in 4, 5 and 6 are iterative in nature. A direct
US
mapping of these equations on hardware will result in a word-serial architecture. Fig. 1
shows the top-level schematic of a word-serial CORDIC multiplier. Such a realization
requires lesser on-chip resources but puts a severe limit on the throughput of the multiplier.
The iterative multiplier equations can be easily unfolded into a multi-stage realization,
wherein the individual iterations are represented by separate stages. Such a realization has
two advantages. First, unlike serial architecture where the shifters need to be updated after
every iteration, the shifters are fixed in unfolded realization. These shifters can be easily
AN
implemented in FPGA wiring, thereby resulting in reduction of resources utilized. Second,
unfolded architectures can be easily pipelined by placing registers along the feed-forward
paths. This improves the throughput of the multiplier. Fig. 2 shows the unfolded CORDIC
multiplier. The dotted lines represent the pipeline stages.
D
DM
AK+1
D
SHIFTER >> k
BK+1
+/-
D
ΘK+1
+/-
TE
2-k
Fig. 1 Word-serial CORDIC multiplier
As
>>1 >>2 >>3 >>4 >>m-1 >>m
Bs Bm
EP
+/- +/- +/- +/- +/- +/-
σ0 σ1 σ2 σ3 σm-2 σm-1
Θs
+/- +/- +/- +/- +/- +/-
20 2-1 2-2 2-3 2-m+2 2-m+1
Fig. 2 Unfolded/Pipelined architecture of CORDIC Multiplier

C
AC
T
IP
CR
6 Author Names
4. Synthesis, Implementation and Analysis

LUT based FPGAs are prevalent in electronic market. The capacity and flexibility of LUTs
varies from vendor to vendor and for a particular vendor from family to family. We have
considered FPGAs that have 6-input LUTs as their basic logic element. Specifically,
Virtex-5 FPGAs from Xilinx have been considered for implementation. Apart from the
US
general LUT fabric these devices also include specialized Carry4 primitives that speed up
the carry propagation encountered in many arithmetic operations. Carry4 primitive is
basically a fast 4-bit carry chain based on Look Ahead Carry logic. The primitive allows
FPGAs to efficiently implement arithmetic operations that involve propagation of carry
within a logic cell.
Detailed implementation has been carried out by realizing serial and unfolded
architectures of CORDIC based multiplier for varying operand word-lengths. The unfolded
AN
CORDIC multiplier is designed in eight stages, first using only conventional LUT fabric
and then using a combination of LUTs and Carry4 primitives. Both combinational and
pipelined realizations have been considered. Performance has been specified in terms of
three parameters viz. resources utilized, timing and power dissipation. Resources include
the number of LUTs, flip-flops and logic slices used. Timing gives the notion of speed. For
combinational realizations, timing analysis mainly focuses on analyzing the paths from
input to output. The result is usually quoted as a single metric that corresponds to the
combinational delay of the critical path. For pipelined realizations, timing analysis is
DM
concerned with the maximum operating frequency of the structure. Both critical path delay
and maximum operating frequency are used to obtain the throughput of a particular
structure. To give a realistic picture of the performance, post Place and Route (PAR) timing
analysis under constrained environments has been done. Post PAR timing analysis also
enables the designers to capture a realistic picture of the switching activity that occurs
along different nodes in a routed design. The same is used to assess the dynamic power
dissipation of an implemented design.
We have compared our multiplier realizations against various traditional approaches
reported in [18-20]. These include the Basic Array Multiplier (BAM), Carry Save
TE
Multiplier (CSM), Carry Ripple Multiplier (CRM), Wallace Tree Multiplier (WTM),
Vedic Multiplier (VM) and three different types of Signed Booth Multiplier (BSM-I, BSM-
II, BSM-III). Additionally, some recent multiplier realizations have also been considered.
These include the multiplexer based array multiplier (MUX-Array) [21], Ring Oscillator
based Booth Multiplier (RO-Booth) [23], Probabilistic Booth Multiplier (Prob.-Booth)
[25], Square Root Carry Select Adder based Wallace Multiplier (SRCSA-Wallace) [26]
and a High-Speed Vedic Multiplier (HS-Vedic) [31]. We have also implemented the
EP
convolution architecture proposed in [33]. The implementation in [33] is based on the

Xilinx IP core multiplier. We have replaced the Xilinx IP core multiplier with CORDIC
based multiplier and analyzed the performance in terms of resources utilized, throughput
and power dissipation. Xilinx Vivado 2016.3 has been used to carry out synthesis,
simulation and implementation of various multiplier and convolution architectures.
C
AC
T
IP
CR
4.1. Resource Analysis

Table 1 provides a comparison of the FPGA resources utilized by different multiplier
architectures for an operand word-length of 16 bits. It is observed that CORDIC based
multipliers have lesser resource utilization when compared to other realizations. Owing to
its iterative nature, the CORDIC serial multiplier (CO-Ser.) shows the minimum LUT and
US
slice count as the resources are being shared among multiple iterations. The unfolded (CO-
Unf.) and pipelined (CO-Pip.) architectures have a higher resource count than the serial
architecture. The resource count can be reduced by using Carry4 primitives in the synthesis
process. The resulting architectures (CO-Unf.-Cry4. and CO-Pip.-Cry4) have a reduced
LUT and slice count. Further analysis is carried out by comparing the CORDIC based
multipliers against some recent multiplier realizations for varying operand word-lengths.
The results are shown in fig. 3 and fig.4.
AN
Fig. 5 shows the LUT utilization of the convolution architecture proposed in [33] using
Xilinx IP core multiplier (XIP-Core) and CORDIC based multipliers as the kernel size is
varied. Again, serial CORDIC based convolution architecture utilizes the minimum LUT
resources. It should be noted that the Xilinx IP core multiplier can be realized using LUTs
or DSP blocks. To ensure a fair comparison we have used LUT based Xilinx IP core
multiplier in our analysis. Further, the IP core multiplier is implemented with an optimal
level of pipelining. This adds to the number of flip-flops used by the convolution
architecture. Evidently, while the serial CORDIC based architecture shows the minimum
DM
flip-flop utilization, the unfolded CORDIC based architectures also have lesser flip-flop
count than the Xilinx IP core based architecture. This is shown in fig. 6. Fig. 7 shows the
variation in slice count for different realizations as the kernel size is varied. The variation
trend is similar to that of fig.5.
Table 1. Resource utilization for different multiplier realizations
Multiplier Design No. of LUTs No. of Flip-flops No. of Slices
BAM [18-20] 398 -- 124

TE
CSM [18-20] 282 -- 93
CRM [18-20] 327 -- 107
WTM [18-20] 331 -- 113
VM [18-20] 451 -- 142
BSM-I [18-20] 431 -- 142

EP
BSM-II [18-20] 237 -- 77
BSM-III [18-20] 341 -- 108
MUX-Array [21] 379 -- 119
RO-Booth [23] 71 100 25
Prob.-Booth [25] 167 -- 63

C
AC
T
IP
CR
8 Author Names
SRCSA-Wallace [26] 311 -- 101
HS-Vedic [31] 418 -- 133
CO-Ser. [this work] 35 33 13

CO-Unf. [this work] 312 -- 113
US
CO-Pip. [this work] 270 332 98
CO-Unf.-Cry4. [this work] 158 -- 58
CO-Pip.-Cry4. [this work] 201 332 71
1600 MUX-Array
RO-Booth Resource Utilization
1400
1200
Prob.-Booth
SRCSA-Wallace
HS-Vedic
CO-Ser.
CO-Unf.
AN
No. of LUTs
1000
CO-Pip.
CO-Unf.-Cry4.
800
CO-Pip.-Cry4.
600
400
DM
200
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Word-length
Fig. 3 LUT utilization for different multiplier realizations.
MUX-Array
450 RO-Booth Resource Utilization
Prob.-Booth
SRCSA-Wallace
TE
375
HS-Vedic
CO-Ser.
300 CO-Unf.
No. of Slices
CO-Pip.
CO-Unf.-Cry4.
225 CO-Pip.-Cry4.
150
EP
75
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Word-length
Fig. 4 Logic Slice utilization for different multiplier realizations.
C
AC
T
IP
CR
2400
Resource Utilization
2200
XIP-Core
2000
CO-Ser
1800 CO-Unf.
CO-Pip.
1600
CO-Unf.-Cry4.
No. of LUTs
US
1400
CO-Pip.-Cry4.
1200
1000
800
600
400
200
0
3×3
AN 5×5 7×7
Filter-Size
9×9
Fig. 5 LUT utilization convolution architecture realizations.

11×11
2200
2000
Resource Utilization
XIP-Core
1800
CO-Ser
DM
1600 CO-Unf.
CO-Pip.
No. of Flip-Flops
1400
CO-Unf.-Cry4.
1200 CO-Pip.-Cry4.
1000
800
600
400
200
0
TE
3×3 5×5 7×7 9×9 11×11
Filter-Size
Fig. 6 Flip-Flop utilization for different convolution architecture realizations.
EP
C
AC
T
IP
CR
10 Author Names
1000 Resource Utilization

XIP-Core
CO-Ser
800 CO-Unf.
CO-Pip.
CO-Unf.-Cry4.
No. of Slices
US
600 CO-Pip.-Cry4.
400
200
0
3×3
AN 5×5 7×7
Filter-Size
9×9 11×11
Fig. 7 Logic Slice utilization for different convolution architecture realizations.
4.2. Timing Analysis

Timing analysis is done to provide a measure of the productivity of the system. This is
usually quoted as the throughput of the system. For pure combinational realizations
DM
throughput is simply the inverse of the delay associated with the critical path. For
synchronous sequential circuits, throughput is determined by the maximum frequency at
which the circuit can be clocked. Table 2 lists the timing metrics of the different multiplier
architectures for an operand word-length of 16 bits. It is observed that CORDIC based
multipliers have comparatively smaller critical path delays. The critical path delays can be
further reduced by pipelining the unfolded architectures. Pipelining also results in an
interleaved operation, thereby increasing the throughput of the multiplier. The highest
clock frequency is achieved with serial architectures. This includes the ring oscillator based
Booth multiplier and CORDIC serial multiplier. However, owing to their serial nature the
resulting throughput is much less. It should, however, be noted that CORDIC serial
TE
multiplier is a word-serial architecture and the throughput will be limited by the number of
iterations. Thus, any increase in the operand word-length will have no impact on the overall
throughput of the multiplier. This is shown in fig. 8 where throughput analysis of CORDIC
based multiplier and some recently proposed multipliers is done for varying operand word-
lengths. The CORDIC serial multiplier exhibits a flat throughput response. This is
advantageous for large operand word-length multipliers which are used in some DSP
EP
applications. Table 3 lists the throughput of different multipliers for an operand word-
length of 128 bits. The CORDIC serial multiplier achieves the maximum throughput.
C
AC
T
IP
CR
Table 2. Timing and Throughput analysis for different multipliers
Multiplier Design Critical Path (nS) Max. Clock Freq. (MHz) Throughput (MHz)
BAM [18-20] 25.34 -- 39.46

CSM [18-20] 13.29 -- 75.245
US
CRM [18-20] 29.83 -- 33.524
WTM [18-20] 19.34 -- 51.7
VM [18-20] 25.76 -- 38.82
BSM-I [18-20] 24.24 -- 41.25
BSM-II [18-20] 20.03 -- 49.93

BSM-III [18-20]
MUX-Array [21]
RO-Booth [23]
Prob.-Booth [25]
AN 16.01
22.78
4.224
15.554
--
--
521.98
--
62.46
43.9
32.63
64.3
SRCSA-Wallace [26] 16.416 -- 60.91
HS-Vedic [31] 23.339 -- 42.85

DM
CO-Ser. [this work] 3.171 667.55 83.444
CO-Unf. [this work] 13.12 -- 76.22
CO-Pip. [this work] 4.571 243.72 243.72
CO-Unf.-Cry4. [this work] 17.76 -- 56.31
CO-Pip.-Cry4. [this work] 3.067 331.24 331.24
MUX-Array
400 Throughput RO-Booth
Prob.-Booth
TE
350 SRCSA-Wallace
HS-Vedic
300 CO-Ser.
Frequency (MHz)
CO-Unf.
250 CO-Pip.
CO-Unf.-Cry4.
200 CO-Pip.-Cry4.
150
EP
100
50
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80
Word-length
Fig. 8 Throughput variation for different multiplier realizations
C
AC
T
IP
CR
12 Author Names
Table 3. Throughput comparison of different multipliers for an operand word-length of 128 bits
Multiplier Design Critical Path (nS) Throughput (MHz)
MUX-Array [21] 192.34 5.2

RO-Booth [23] 32.56 30.71
US
Prob.-Booth [25] 132.72 7.53
SRCSA-Wallace [26] 102.45 9.7
HS-Vedic [31] 172.63 5.8
CO-Ser. [this work] 3.171 83.444
CO-Unf. [this work] 91.32 10.95

AN
CO-Pip. [this work]
CO-Unf.-Cry4. [this work]
CO-Pip.-Cry4. [this work]

32.33
117.46
23.42
30.93
8.5
42.7
Fig. 9 shows the throughput variation of the convolution architecture based on Xilinx IP
core multiplier and different CORDIC based multipliers as the kernel size is varied. It is
observed that convolution architecture based on the pipelined Carry4 CORDIC multiplier
DM
has the highest throughput. As mentioned previously, Xilinx IP core multiplier is
implemented with an optimal level of pipelining. Thus any latency concerns that are
prevalent in pipelined CORDIC multiplier based architecture will also exist in the Xilinx
IP based architecture. However, for a real time application, throughput is more important
a parameter than latency. Another interesting analysis is with regards to the critical path of
different architectures. Fig. 10 shows the critical path variation as a function of kernel size.
The Xilinx IP core based architecture has comparatively more critical path than the serial
and pipelined CORDIC based architectures. This is one of the drawbacks of using hard IP
cores. These cores are always fixed in the FPGA fabric, thereby, increasing the cost of
routing data to and from these cores. While this may not affect the throughput of the system,
TE
it increases the physical capacitance associated with these routes, resulting in greater power
dissipation as discussed in the next section.
EP
C
AC
T
IP
CR
XIP-Core
CO-Ser
240 CO-Unf.
Throughput CO-Pip.
220
CO-Unf.-Cry4.
200 CO-Pip.-Cry4.
180
US
Frequency (MHz)
160
140
120
100
80
60
40
20
0
3×3
AN 5×5 7×7
Filter-Size
9×9 11×11
Fig. 9 Throughput variations for different convolution architecture realizations
Critical Path
14 XIP-Core
CO-Ser
DM
12 CO-Unf.
CO-Pip.
10 CO-Unf.-Cry4.
CO-Pip.-Cry4.
Delay (nSec.)
2
TE
0
3×3 5×5 7×7 9×9 11×11
Filter-Size
Fig. 10 Critical path variation for different convolution architecture realizations
4.3. Power Analysis

EP
For power analysis all the multipliers are implemented using synchronous design practices.
Such an approach involves use of registers at the input and the output of the multiplier.
Power dissipation is a strong function of the clock frequency. It also has a direct relation
with the physical capacitances along different nodes and routes within a design. To ensure
a fair comparison, power analysis is done for a nominal clock frequency of 100 MHz for
C
AC
T
IP
CR
14 Author Names
all the multipliers. Table 4 lists the dynamic power dissipation of different multiplier
realizations for an operand word-length of 16 bits. Serial multiplier architectures (RO-
Booth and CO-Ser.) show the least power dissipation as they utilize the minimum
underlying resources, thereby reducing the logic power dissipation. Additionally, these
multipliers also have smaller critical paths and thus lesser route capacitances. This further
reduces the dynamic power dissipation. Unfolded CORDIC multipliers also have lesser
US
power dissipation because of the reduced usage of logic resources. Pipelining further
reduces the power dissipation by breaking the critical paths, resulting in reduced
capacitances associated with these routes. Further analysis plots the dynamic power
dissipation as a function of operand word-length. The results are shown in fig. 11. Similar
trends are observed in convolution architectures based on different multipliers.
Architectures based on serial and pipelined multiplier realizations (XIP-Core, CO-Ser.,
AN
CO-Pip., CO-Pip. Cry4.) have comparatively lesser power dissipation. The CORDIC serial
multiplier based architecture has the least power dissipation as it utilizes fewer underlying
logic resources. The results are shown in fig. 12.
Table 4. Dynamic Power dissipation for different multiplier realizations
Multiplier Design Dynamic Power Dissipation (mW)
BAM [18-20] 43.75

DM
CSM [18-20] 29.63
CRM [18-20] 42.81
WTM [18-20] 29.34
VM [18-20] 37.82
BSM-I [18-20] 43.86
BSM-II [18-20] 42.95
BSM-III [18-20] 40.03
MUX-Array [21] 37.72

TE
RO-Booth [23] 3.98
Prob.-Booth [25] 12.64
SRCSA-Wallace [26] 15.32
HS-Vedic [31] 19.26
CO-Ser. [this work] 3.47

EP
CO-Unf. [this work] 13.24
CO-Pip. [this work] 9.37
CO-Unf.-Cry4. [this work] 15.32
CO-Pip.-Cry4. [this work] 10.42

C
AC
T
IP
CR
150
MUX-Array
RO-Booth
Dynamic Power Dissipation
Prob.-Booth
120 SRCSA-Wallace
HS-Vedic
CO-Ser.
CO-Unf.
Power (mW)
US
90
CO-Pip.
CO-Unf.-Cry4.
CO-Pip.-Cry4.
60
30
0
0 4 8 12 AN
16 20 24 28 32 36
Word-length
40 44 48 52
Fig. 11 Dynamic Power dissipation for different multiplier realizations

56 60 64 68
70
Dynamic Power Dissipation
XIP-Core
60
DM
CO-Ser
CO-Unf.
50 CO-Pip.
Power (mW)
CO-Unf.-Cry4.
40 CO-Pip.-Cry4.
30
20
10
TE
0
3×3 5×5 7×7 9×9 11×11
Filter-Size
Fig. 12 Dynamic Power dissipation for different convolution architecture realizations
4.4. Error Analysis

EP
The enhanced performance of fixed-point multipliers comes at the cost of reduced

accuracy. A careful balancing of range and precision in fixed-point calculations can reduce
this problem. CORDIC based multipliers incur and additional loss in accuracy because of
their iterative nature. The foregoing analysis expresses this loss in accuracy in terms of
mean absolute percent error (MAPE). Absolute percentage error for more than 350
randomly chosen operands was recorded and mean was calculated. Our analysis for an 8-
C
AC
T
IP
CR
16 Author Names
stage unfolded/pipelined architecture reported an MAPE of 6.032. Same MAPE was

obtained for a serial architecture that calculated the product in eight iterations. An increase
in the number of stages (or number of iterations for serial architecture) results in an
exponential decrease in MAPE. This is shown in fig. 13. A trade-off between accuracy and
performance will, therefore, determine the proper choice of CORDIC multiplier for a
particular application. Fig. 14 shows the increase in LUT resources for a reduced MAPE
US
for an unfolded Carry4 based CORDIC multiplier with an operand word-length of 16 bits.
Similar analysis plots the reduction in throughput for a reduced MAPE for a serial
CORDIC multiplier. The results are shown in fig. 15.
6.75
6.00
5.25
AN
4.50
3.75
MAPE
3.00
2.25
1.50
0.75
0.00
7 8 9 10 11 12 13 14 15 16 17
No. of Stages/Iterations
Fig. 13 Variation in MAPE with number of CORDIC stages/iterations

DM
400
350
No. of LUTs
300
250
200
150
0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75
TE
MAPE
Fig. 14 Variation in LUT utilization with MAPE
84
77
Throughput (MHz)
70
EP
63
56
49
42
0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75
MAPE
Fig. 15 Variation in throughput with MAPE

C
AC
T
IP
CR
5. Conclusions and Future scope

This work proposed a novel implementation of fixed-point multiplication based on
CORDIC algorithm. Different architectures were realized using FPGA platforms. Detailed
analysis revealed that a speed-up in performance is indeed achievable using CORDIC

based multipliers. This speed-up was achieved at the cost of reduced accuracy. A trade-off
US
between performance and accuracy will determine the proper choice of CORDIC
architecture for a particular application. Our analysis also reveals that the performance
speed-up using CORDIC based multipliers can also be propagated to more complex
operations like convolution. Our future endeavors will focus on using CORDIC algorithm
to perform multiply-accumulate operation. The same can be used as a building block to
realize efficient high-performance filter structures.
Acknowledgments
AN
This work was carried out under the seed grant initiative of TEQIP-III project. The authors
are grateful to the TEQIP-III project team of IUST for their assistance and financial support
during the entire course of study.
References
1. G. L. Narayan and B. Venkataramani, Optimization Techniques for FPGA based Wave
Pipelined DSP Blocks, IEEE Transc. Very Large Scale Integr. (VLSI) syst., 13 (2005), 783-792.
DM
2. M. A. Ashour and H. I. Saleh, An FPGA Implementation guide for some different types of
Serial-Parallel Multiplier Structures, Microelectronics Journal, 31 (2000), 161-168.
3. K. Compton, S. Hauck, Reconfigurable Computing: A survey of Systems and Software, ACM
Computing Surveys, 34 (2002), 171-210.
4. IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standards Board, (2018).
5. Technical Report ANSI/IEEE Std. 754-1985, the Institute of Electrical and Electronics
Engineers (1985).
6. C. Inacio, D. Ombres, The DSP decision: Fixed point or floating? IEEE Spectrum, 33 (1996),
72-74.
7. R. Tessier and W. Burleson, Reconfigurable Computing for DSP: A Survey, Journal of VLSI
Signal Processing, 28 (2001), 7-27.
8. S. Hauck and A. Dehon, Reconfigurable Computing: The Theory and Practice of FPGA-based
TE
Computing, (Morgan Kaufmann series, 2008).

9. D. H. Timmerman, B. J. Hosticka, G. Schimdt, A Programmable CORDIC chip for Digital
Signal Processing Applications, IEEE Journal of Solid State Circuits, 26 (1991), 1317-1321.
10. W. H. Chen, C. H. Smith, S. C. Fralick, A fast Computational Algorithm for the Discrete Cosine
Transform, IEEE Transactions on Communications, 25 (1977), 1004-1009.
11. A. M. Despain, Very Fast Fourier Transform Algorithms for Hardware Implementation, IEEE
Transactions on Computers, 28 (1979), 333-341.
EP
12. A. M. Despain, Fourier Transform Computers using CORDIC Iterations, IEEE Transactions on
Computers, 23, (1974), 993-1001.
13. B. Haller, J. Gotze, J. Cavallaro, Efficient Implementation of Rotation Operations for high-
performance QRD-RLS filtering, Proceedings of the International Conference on Application
Specific Systems, Architectures and Processors, (1997).
14. J. R. Cavallaro, F. T. Luk, CORDIC Arithmetic for an SVD Processor, Journal of Parallel and
Distributed Computing, (1988), 271-290.
C
AC
T
IP
CR
18 Author Names
15. T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk and P. Y. K. Cheung,

Reconfigurable Computing: Architecture and Design Methods, IEEE Proceedings. Computer
Digital Technology, 152 (2005), 193-207.
16. R. Naseer, M. Balakrishnan, and A. Kumar, Direct Mapping of RTL Structures onto LUT-Based
FPGAs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17

(1998), 624-631.
O. Kwon, K. Nowka, and Jr. Swartzlander, A 16-bit by 16-bit MAC design using fast 5:3
US
17.
compressor cells, Journal of VLSI Signal Processing, 31 (2002), 77-89.
18. S. Bhattacharjee, S. Sil, B. Basak and A. Chakarbarti, Evaluation of Power Efficient Adder and
Multiplier Circuits for FPGA based DSP Applications, Proceedings of the International
Conference on Communication and Industrial Applications (ICCIA), (2011)
19. B. Khurshid and R. Naaz, Technology Optimized Fixed-Point Bit-Parallel Multiplier for LUT
based FPGAs, International Journal of High Performance Systems Architecture, 6 (2016), 28-
35.
K. Kumar, V. Tyagi, H. Kukreja, S. Thakral and M. Verma, A State-of-the Art Study on

20.
21.
22.
AN
Multipliers: Advancement and Comparison, IIOAB Journal, 09 (2018), 54-66.
S. Srikanth, I. T. Banu, G. V. Priya and G. Usha, Low power array multiplier using modified
full adder, IEEE International Conference on Engineering and Technology (ICETECH),
Coimbatore, (2016), 1041-1044.
S. K. Sahoo and C. Shekhar, Delay optimized array multiplier for signal and image processing,
International Conference on Image Information Processing, Shimla, (2011), 1-4.
23. D. Okamoto, M. Kondo, T. Yokogawa, Y. Sejima, K. Arimoto and Y. Sato, A Serial Booth
Multiplier Using Ring Oscillator, Fourth International Symposium on Computing and
Networking (CANDAR), Hiroshima, (2016), 458-461.
DM
24. R. Shrestha and U. Rastogi, Design and Implementation of Area-Efficient and Low-Power
Configurable Booth-Multiplier, 29th International Conference on VLSI Design and 2016 15th
International Conference on Embedded Systems (VLSID), Kolkata, (2016), 599-600.
25. M. V. Durga Pavan and S. R. Ramesh, An Efficient Booth Multiplier Using Probabilistic
Approach, International Conference on Communication and Signal Processing (ICCSP),
Chennai, (2018), 365-368.
26. R. B. S. Kesava, B. L. Rao, K. B. Sindhuri and N. U. Kumar, Low Power and Area Efficient
Wallace Tree Multiplier using Carry Select Adder with Binary to Excess-1 Converter,
Conference on Advances in Signal Processing (CASP), Pune, (2016), 248-253.
27. D. Paradhasaradhi, M. Prashanthi and N. Vivek, Modified Wallace Tree Multiplier using
Efficient Square Root Carry Select Adder, International Conference on Green Computing
TE
Communication and Electrical Engineering (ICGCCEE), Coimbatore, (2014), 1-5.

28. S. Asif and Y. Kong, Design of an Algorithmic Wallace Multiplier using High Speed Counters,
Tenth International Conference on Computer Engineering & Systems (ICCES), Cairo, (2015),
133-138.
29. R. Suryavanshi and S. Khare, An Efficient High-Performance Vedic Multiplier: Review,
International Journal of Advanced Engineering and Management, IJOAEM, 2, (2017), 60-64.
30. J. S. Edle and P. R. Deshmukh, Application Specific Architecture of 32-Bit Vedic Multiplier,
International Conference on Computing, Communication, Control and Automation
EP
(ICCUBEA), Pune, (2017), 1-6.

31. D. K. Kahar and H. Mehta, High Speed Vedic Multiplier used Vedic Mathematics, International
Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, (2017), 356-
359.
32. B. N. K. Reddy, Design and Implementation of high performance and area efficient square
architecture using Vedic Mathematics, Analog Integrated Circuits and Signal Processing,
Springer, 102, (2020), 501-506.
C
AC
T
IP
CR
33. A. K. Joginipelly and D. Charalampidis, Efficient separable convolution using field

programmable gate arrays, Microprocessors and Micrsystems, Elsevier, 71, (2019), 8-15.
34. J. E. Volder, The CORDIC Trigonometric Computing Technique, IRE Trans. Electronic
Computers, 8 (1959), 330-334.
35. D. Ercegovac, T. Lang, Digital Arithmetic, (Morgan Kaufmann, 2004).
US
AN
DM
TE
EP
C
AC

10 1142@S0218126621500808

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1142@S0218126621500808

Uploaded by

Copyright:

Available Formats

Accepted manuscript to appear in JCSC

Article Title: An Efficient Fixed-Point Multiplier based on CORDIC Algorithm

Author(s): Burhan Khurshid, Javeed Jeelani Khan

Received: 26 September 2019

Accepted: 14 July 2020

Link to final version: https://doi.org/10.1142/S0218126621500808

An Efficient Fixed-Point Multiplier based on CORDIC Algorithm *

Javeed Jeelani Khan

AN Awantipora (J&K), India

Received (Day Month Year)

performance DSP hardware is concerned [1-3]. Fixed-point representation may be

amount of parallelism that is available in many hardware efficient algorithms like

power dissipation accompanied by reduction in resource utilization through clock gating

fast parallel multipliers. Some noteworthy contributions are reported in [30-32].

displacement between the final vector and the horizontal axis.

A and B are coordinate vectors,

functions can be evaluated.

𝐵𝑘+1 = 𝐵𝑘 + 𝜎𝑘 . 𝐴𝑘 . 2−𝑘 (5)

product of As and Θs vectors i.e. for Bs = 0:

𝐴𝑠 = 𝑎𝑁−1 . 𝑎𝑁−2 𝑎𝑁−3 … … 𝑎1 𝑎0 (11)

Fig. 1 Word-serial CORDIC multiplier

+/- +/- +/- +/- +/- +/-

20 2-1 2-2 2-3 2-m+2 2-m+1

Fig. 2 Unfolded/Pipelined architecture of CORDIC Multiplier

4. Synthesis, Implementation and Analysis

convolution architecture proposed in [33]. The implementation in [33] is based on the

4.1. Resource Analysis

Table 1. Resource utilization for different multiplier realizations

Multiplier Design No. of LUTs No. of Flip-flops No. of Slices

BAM [18-20] 398 -- 124

CSM [18-20] 282 -- 93

CRM [18-20] 327 -- 107

WTM [18-20] 331 -- 113

VM [18-20] 451 -- 142

BSM-I [18-20] 431 -- 142

BSM-II [18-20] 237 -- 77

BSM-III [18-20] 341 -- 108

MUX-Array [21] 379 -- 119

RO-Booth [23] 71 100 25

Prob.-Booth [25] 167 -- 63

SRCSA-Wallace [26] 311 -- 101

HS-Vedic [31] 418 -- 133

CO-Ser. [this work] 35 33 13

CO-Unf. [this work] 312 -- 113

CO-Unf.-Cry4. [this work] 158 -- 58

CO-Pip.-Cry4. [this work] 201 332 71

Fig. 5 LUT utilization convolution architecture realizations.

3×3 5×5 7×7 9×9 11×11

1000 Resource Utilization

Fig. 7 Logic Slice utilization for different convolution architecture realizations.

4.2. Timing Analysis

Table 2. Timing and Throughput analysis for different multipliers

BAM [18-20] 25.34 -- 39.46

CSM [18-20] 13.29 -- 75.245

WTM [18-20] 19.34 -- 51.7

VM [18-20] 25.76 -- 38.82

BSM-I [18-20] 24.24 -- 41.25

BSM-II [18-20] 20.03 -- 49.93

SRCSA-Wallace [26] 16.416 -- 60.91

HS-Vedic [31] 23.339 -- 42.85

CO-Ser. [this work] 3.171 667.55 83.444

CO-Unf. [this work] 13.12 -- 76.22