You are on page 1of 20

Accepted manuscript to appear in JCSC

Accepted Manuscript
Journal of Circuits, Systems and Computers
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

Article Title: An Efficient Fixed-Point Multiplier based on CORDIC Algorithm

Author(s): Burhan Khurshid, Javeed Jeelani Khan

DOI: 10.1142/S0218126621500808

Received: 26 September 2019


J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

Accepted: 14 July 2020

To be cited as: Burhan Khurshid, Javeed Jeelani Khan, An Efficient Fixed-Point Mul-
tiplier based on CORDIC Algorithm, Journal of Circuits, Systems and
Computers, doi: 10.1142/S0218126621500808

Link to final version: https://doi.org/10.1142/S0218126621500808

This is an unedited version of the accepted manuscript scheduled for publication. It has been uploaded
in advance for the benefit of our customers. The manuscript will be copyedited, typeset and proofread
before it is released in the final form. As a result, the published copy may differ from the unedited
version. Readers should obtain the final version from the above link when it is published. The authors
are responsible for the content of this Accepted Article.
Manuscript (pdf) Click here to access/download;Manuscript
(pdf);MANUSCRIPT-REVISED.docx

T
Accepted manuscript to appear in JCSC

IP
CR
Journal of Circuits, Systems, and Computers
 World Scientific Publishing Company
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

An Efficient Fixed-Point Multiplier based on CORDIC Algorithm *

US
Burhan Khurshid†
Department of ECE, IUST,
Awantipora (J&K), India
burhan32.iust@gmail.com

Javeed Jeelani Khan


Department of ECE, IUST,
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN Awantipora (J&K), India


javeedjeelanikhan@gmail.com

Received (Day Month Year)


Revised (Day Month Year)
Accepted (Day Month Year)

Fixed-point multiplication is an important operation that is frequently used in many digital signal
DM

processing (DSP) applications. The operation is computationally intense and very often the
performance of multiplier determines the overall performance of DSP system. Evidently, a wide range
of approaches have been proposed for efficient implementation of fixed-point multipliers on different
hardware platforms. In this paper, we use COordinate Rotation DIgital Computer (CORDIC)
algorithm to perform fixed-point multiplication operation. The motivation for our approach is based
on the fact that CORDIC is a hardware-efficient algorithm, wherein accuracy can be traded-off for
performance. Our implementation targets field programmable gate arrays (FPGAs) and focusses on
exploiting the underlying general and specialized fabric to the fullest. Performance comparisons
against various traditional and recent approaches show that a substantial improvement is achievable
by using CORDIC based multipliers. We have also implemented a recently proposed convolution
architecture using CORDIC based multipliers. The results show that a proper choice of CORDIC
architecture can result in an improvement of performance parameters like resource utilization,
throughput and dynamic power. This, however, is achieved in lieu of a small cost in accuracy. Our
TE

analysis of an 8-stage CORDIC multiplier reports a mean absolute percentage error (MAPE) of 6.032-
a factor that reduces exponentially with increasing number of stages.
Keywords: CORDIC; FPGA; Carry4 primitives; Fixed-point Arithmetic; LUT.

1. Introduction
Fixed-point multipliers are the elements of choice when design and implementation of high
EP

performance DSP hardware is concerned [1-3]. Fixed-point representation may be


considered as a degenerate case of floating-point representation, in the sense that the
exponent is fixed and does not vary with time. Such a representation is apt for FPGAs as it
results in the reduction of the number of logic levels, thereby limiting the end-to-end


Corresponding author.
C

1
AC
T
Accepted manuscript to appear in JCSC

IP
CR
2 Author Names

critical path delays. Performance versus accuracy trade-offs often compel designers to use
fixed-point multipliers in their realizations [4-5]. Thus their performance is crucial and will
affect the overall performance of the top-level design they are part of [6-7].
Efficient implementation of complex arithmetic operations has always been a challenge
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

for designers. Traditionally two approaches have been used: Look-up table (LUT) method
and Polynomial expansion. While the former approach demands large sized LUTs for

US
higher precision, the latter suffers from the problem of slow convergence. COordinate
Rotation DIgital Computer (CORDIC) represents a compromise between these two
methods, where the desired precision is achieved using relatively lesser number of LUTs.
This flexibility has enabled CORDIC to encapsulate a diversity of arithmetic functions
using a single basic set of recursive equations [8-9]. Some applications include
computation of Discrete Cosine Transform (DCT) [10], Fast Fourier Transform (FFT) [11-
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
12], Recursive Least Square (RLS) filtering [13], Singular Value Decomposition (SVD)
[14] etc.
FPGAs are often used as implementation platforms to perform high speed tasks that
cannot be achieved using conventional processors. With distinctive advantages like lower
Non-Recurring Engineering (NRE) costs, reconfigurable design approach, high integration
levels, post-production design verification [15] etc., FPGAs are fast moving from prototype
designing to low and medium volume productions [16-17]. The architectural organization
of FPGAs enable realizations that are distributed spatially. This enables to capture a huge
DM

amount of parallelism that is available in many hardware efficient algorithms like


CORDIC. Thus FPGAs serve as the platforms of choice when implementation of such
algorithms is concerned. Another important feature of modern FPGAs is the introduction
of specialized primitives and macroblocks. The inclusion of these special blocks in the
design process considerably speeds up the performance aspects of the top-level design.
A large number of traditional approaches towards multiplication have been extended
to the implementation of fixed-point multipliers. Prominent among these include Array
multipliers, Carry ripple multipliers, Carry save multipliers, Booth multipliers, Wallace
tree multipliers, Vedic multipliers etc. A detailed discussion about these multipliers can be
found in [18-20]. Owing to the increased demand of performance, there has been a lot of
TE

modification to these traditional approaches. In [21] the authors propose a low power array
multiplier using multiplexer based full-adder cells. A related approach is reported in [22],
wherein the full-adder cell is modified to achieve reduced delays. Booth and Wallace
multipliers have also been modified for different performance parameters. In [23] the
authors come up with a new realization of Booth multiplier using ring oscillator. The
proposed architecture is bit serial in nature with a low power dissipation. A reduction in
EP

power dissipation accompanied by reduction in resource utilization through clock gating


and resource sharing in Booth multiplier is reported in [24]. While the approaches in [23]
and [24] are exact, a probabilistic approach towards Booth multiplication is reported in
[25]. The authors replace the conventional truncation circuitry with a probabilistic
estimation bias circuit and claim to achieve better resource utilization and reduced power
dissipation. Similar, low-power and area-efficient realizations of Wallace tree multipliers
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 3

using carry-select adder and binary to excess-1 converter is reported in [26] and [27]. High
speed counter based Wallace multipliers have also been reported in [28]. Another approach
that has gained a lot of prominence in recent times is based on the usage of Vedic arithmetic
[29]. Vedic multipliers offer a high degree of parallelism that can be exploited to implement
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

fast parallel multipliers. Some noteworthy contributions are reported in [30-32].


While all the above mentioned approaches have a high degree of accuracy, they

US
sometimes fail to achieve the requisite performance demanded by some DSP applications.
CORDIC based computations can result in multiplier implementations wherein accuracy
can be traded off for performance. The use of CORDIC algorithm for computation of
complex, trigonometric and hyperbolic functions is quite frequent, however, very rarely
has it been used for computation of linear operations like multiplication. In this paper, we
experimentally show that multipliers based on CORDIC can out-perform the above
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
mentioned approaches in terms of different performance parameters. We have also used
CORDIC based multipliers for computation of 2D convolution and compared our results
with some recent work [33]. Additionally, some error analysis has also been done that gives
a clear understanding of performance versus accuracy trade-offs that are associated with
CORDIC multipliers.
The rest of the paper is organized as follows: Section 2 briefly discusses the CORDIC
algorithm. Section 3 discusses the CORDIC based multiplication process and the resulting
architectures thereof. Synthesis, implementation and analysis is carried out in section 4.
DM

Conclusions are drawn in section 5 which also discusses the future scope of the work.
References are listed at the end.

2. CORDIC Algorithm
Since its introduction in 1959 by Volder [34], the basic CORDIC algorithm has been
expanded and modified to encapsulate a wide range of arithmetic functions into a single
basic set of equations. Depending on the exact nature of function to be evaluated, the
algorithm can be defined under linear, circular and hyperbolic coordinates [35]. As far as
operation is concerned, there are two basic modes: the rotation mode and the vectoring
TE

mode. The rotation mode operates by first specifying a desired rotation angle. After every
iteration the aim is to diminish the magnitude of the residual angle. Ideally, the angle value
should reduce to zero. However, a trade-off between the number of iterations and the
accuracy will define the final magnitude of the residual angle. The vectoring mode aims at
aligning the resultant vector along the horizontal axis. This is achieved by rotating the input
vector through whatever angle is necessary to align it along the horizontal axis. Again, a
trade-off between the number of iterations and accuracy will define the angular
EP

displacement between the final vector and the horizontal axis.


Both operating modes can be captured using a single set of unified equations:
𝐴𝑘+1 = 𝐴𝑘 − 𝑛. 𝜎𝑘 . 𝐵𝑘 . 2−𝑘 (1)
𝐵𝑘+1 = 𝐵𝑘 + 𝜎𝑘 . 𝐴𝑘 . 2−𝑘 (2)
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
4 Author Names

𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛−1 (2−𝑘 ) 𝑖𝑓 𝑛 = 1
𝛩𝑘+1 = {𝛩𝑘 − 𝜎𝑘 . 𝑡𝑎𝑛ℎ−1 (2−𝑘 )} 𝑖𝑓 𝑛 = −1 (3)
𝛩𝑘 − 𝜎𝑘 . (2−𝑘 ) 𝑖𝑓 𝑛 = 0
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

Where,
+1 𝑓𝑜𝑟 𝐶𝑖𝑟𝑐𝑢𝑙𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠
𝑛= 0 𝑓𝑜𝑟 𝐿𝑖𝑛𝑒𝑎𝑟 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠

US
−1 𝑓𝑜𝑟 𝐻𝑦𝑝𝑒𝑟𝑏𝑜𝑙𝑖𝑐 𝐶𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑠

A and B are coordinate vectors,


Θ is the angle vector,
σ determines the direction of rotation in the next iteration
By operating the unified equations in different coordinate systems, different arithmetic
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

functions can be evaluated.


AN
3. CORDIC based Multiplier
Section 2 discussed the unified CORDIC equations for the linear, circular and hyperbolic
coordinate systems. For linear coordinate system, n = 0 and the basic set of iterative
equations become:
𝐴𝑘+1 = 𝐴𝑘 (4)
DM

𝐵𝑘+1 = 𝐵𝑘 + 𝜎𝑘 . 𝐴𝑘 . 2−𝑘 (5)


𝛩𝑘+1 = 𝛩𝑘 − 𝜎𝑘 . (2−𝑘 ) (6)
After m iterations the resultant equations would be:
𝐴𝑚 = 𝐴𝑠 (7)
𝐵𝑚 = 𝐵𝑠 + 𝐴𝑠 . 𝛩𝑠 (8)
𝛩𝑚 = 0 (9)
If B vector has an initial value Bs = 0, then after m iterations the B vector will give the
TE

product of As and Θs vectors i.e. for Bs = 0:


𝐵𝑚 = 𝐴𝑠 . 𝛩𝑠 (10)
Equation 10 can be used to perform the multiply operation of the vectors As and Θs. Since
CORDIC operates using simple shift and add operations, the mapping of equation 10 on
hardware can provide an efficient multiplier block. Note that the multiplication is fixed
point. Therefore, N-bit operands As and Θs may be represented as:
EP

𝐴𝑠 = 𝑎𝑁−1 . 𝑎𝑁−2 𝑎𝑁−3 … … 𝑎1 𝑎0 (11)


𝛩𝑠 = 𝜃𝑁−1 . 𝜃𝑁−2 𝜃𝑁−3 … … 𝜃1 𝜃0 (12)
The magnitude of these numbers lies in the range [-1, 1) and is given by:
𝐴𝑠 = −𝑎𝑁−1 + ∑𝑁−1
𝑖=1 𝑎𝑁−1 . 2
−𝑖
(13)
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 5

𝛩𝑠 = −𝜃𝑁−1 + ∑𝑁−1
𝑖=1 𝜃𝑁−1 . 2
−𝑖
(14)
The product Bm is truncated product, i.e. N-1 lower order bits are discarded. Therefore,
𝐵𝑚 = 𝑏𝑁−1 . 𝑏𝑁−2 𝑏𝑁−3 … … 𝑏1 𝑏0 (15)
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

The CORDIC multiplier equations given in 4, 5 and 6 are iterative in nature. A direct

US
mapping of these equations on hardware will result in a word-serial architecture. Fig. 1
shows the top-level schematic of a word-serial CORDIC multiplier. Such a realization
requires lesser on-chip resources but puts a severe limit on the throughput of the multiplier.
The iterative multiplier equations can be easily unfolded into a multi-stage realization,
wherein the individual iterations are represented by separate stages. Such a realization has
two advantages. First, unlike serial architecture where the shifters need to be updated after
every iteration, the shifters are fixed in unfolded realization. These shifters can be easily
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
implemented in FPGA wiring, thereby resulting in reduction of resources utilized. Second,
unfolded architectures can be easily pipelined by placing registers along the feed-forward
paths. This improves the throughput of the multiplier. Fig. 2 shows the unfolded CORDIC
multiplier. The dotted lines represent the pipeline stages.

D
DM

AK+1

D
SHIFTER >> k

BK+1
+/-

D
ΘK+1
+/-
TE

2-k

Fig. 1 Word-serial CORDIC multiplier

As
>>1 >>2 >>3 >>4 >>m-1 >>m

Bs Bm
EP

+/- +/- +/- +/- +/- +/-

σ0 σ1 σ2 σ3 σm-2 σm-1

Θs
+/- +/- +/- +/- +/- +/-

20 2-1 2-2 2-3 2-m+2 2-m+1

Fig. 2 Unfolded/Pipelined architecture of CORDIC Multiplier


C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
6 Author Names

4. Synthesis, Implementation and Analysis


LUT based FPGAs are prevalent in electronic market. The capacity and flexibility of LUTs
varies from vendor to vendor and for a particular vendor from family to family. We have
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

considered FPGAs that have 6-input LUTs as their basic logic element. Specifically,
Virtex-5 FPGAs from Xilinx have been considered for implementation. Apart from the

US
general LUT fabric these devices also include specialized Carry4 primitives that speed up
the carry propagation encountered in many arithmetic operations. Carry4 primitive is
basically a fast 4-bit carry chain based on Look Ahead Carry logic. The primitive allows
FPGAs to efficiently implement arithmetic operations that involve propagation of carry
within a logic cell.
Detailed implementation has been carried out by realizing serial and unfolded
architectures of CORDIC based multiplier for varying operand word-lengths. The unfolded
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
CORDIC multiplier is designed in eight stages, first using only conventional LUT fabric
and then using a combination of LUTs and Carry4 primitives. Both combinational and
pipelined realizations have been considered. Performance has been specified in terms of
three parameters viz. resources utilized, timing and power dissipation. Resources include
the number of LUTs, flip-flops and logic slices used. Timing gives the notion of speed. For
combinational realizations, timing analysis mainly focuses on analyzing the paths from
input to output. The result is usually quoted as a single metric that corresponds to the
combinational delay of the critical path. For pipelined realizations, timing analysis is
DM

concerned with the maximum operating frequency of the structure. Both critical path delay
and maximum operating frequency are used to obtain the throughput of a particular
structure. To give a realistic picture of the performance, post Place and Route (PAR) timing
analysis under constrained environments has been done. Post PAR timing analysis also
enables the designers to capture a realistic picture of the switching activity that occurs
along different nodes in a routed design. The same is used to assess the dynamic power
dissipation of an implemented design.
We have compared our multiplier realizations against various traditional approaches
reported in [18-20]. These include the Basic Array Multiplier (BAM), Carry Save
TE

Multiplier (CSM), Carry Ripple Multiplier (CRM), Wallace Tree Multiplier (WTM),
Vedic Multiplier (VM) and three different types of Signed Booth Multiplier (BSM-I, BSM-
II, BSM-III). Additionally, some recent multiplier realizations have also been considered.
These include the multiplexer based array multiplier (MUX-Array) [21], Ring Oscillator
based Booth Multiplier (RO-Booth) [23], Probabilistic Booth Multiplier (Prob.-Booth)
[25], Square Root Carry Select Adder based Wallace Multiplier (SRCSA-Wallace) [26]
and a High-Speed Vedic Multiplier (HS-Vedic) [31]. We have also implemented the
EP

convolution architecture proposed in [33]. The implementation in [33] is based on the


Xilinx IP core multiplier. We have replaced the Xilinx IP core multiplier with CORDIC
based multiplier and analyzed the performance in terms of resources utilized, throughput
and power dissipation. Xilinx Vivado 2016.3 has been used to carry out synthesis,
simulation and implementation of various multiplier and convolution architectures.
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 7

4.1. Resource Analysis


Table 1 provides a comparison of the FPGA resources utilized by different multiplier
architectures for an operand word-length of 16 bits. It is observed that CORDIC based
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

multipliers have lesser resource utilization when compared to other realizations. Owing to
its iterative nature, the CORDIC serial multiplier (CO-Ser.) shows the minimum LUT and

US
slice count as the resources are being shared among multiple iterations. The unfolded (CO-
Unf.) and pipelined (CO-Pip.) architectures have a higher resource count than the serial
architecture. The resource count can be reduced by using Carry4 primitives in the synthesis
process. The resulting architectures (CO-Unf.-Cry4. and CO-Pip.-Cry4) have a reduced
LUT and slice count. Further analysis is carried out by comparing the CORDIC based
multipliers against some recent multiplier realizations for varying operand word-lengths.
The results are shown in fig. 3 and fig.4.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
Fig. 5 shows the LUT utilization of the convolution architecture proposed in [33] using
Xilinx IP core multiplier (XIP-Core) and CORDIC based multipliers as the kernel size is
varied. Again, serial CORDIC based convolution architecture utilizes the minimum LUT
resources. It should be noted that the Xilinx IP core multiplier can be realized using LUTs
or DSP blocks. To ensure a fair comparison we have used LUT based Xilinx IP core
multiplier in our analysis. Further, the IP core multiplier is implemented with an optimal
level of pipelining. This adds to the number of flip-flops used by the convolution
architecture. Evidently, while the serial CORDIC based architecture shows the minimum
DM

flip-flop utilization, the unfolded CORDIC based architectures also have lesser flip-flop
count than the Xilinx IP core based architecture. This is shown in fig. 6. Fig. 7 shows the
variation in slice count for different realizations as the kernel size is varied. The variation
trend is similar to that of fig.5.

Table 1. Resource utilization for different multiplier realizations

Multiplier Design No. of LUTs No. of Flip-flops No. of Slices

BAM [18-20] 398 -- 124


TE

CSM [18-20] 282 -- 93

CRM [18-20] 327 -- 107

WTM [18-20] 331 -- 113

VM [18-20] 451 -- 142

BSM-I [18-20] 431 -- 142


EP

BSM-II [18-20] 237 -- 77

BSM-III [18-20] 341 -- 108

MUX-Array [21] 379 -- 119

RO-Booth [23] 71 100 25

Prob.-Booth [25] 167 -- 63


C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
8 Author Names

SRCSA-Wallace [26] 311 -- 101

HS-Vedic [31] 418 -- 133

CO-Ser. [this work] 35 33 13


by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

CO-Unf. [this work] 312 -- 113

US
CO-Pip. [this work] 270 332 98

CO-Unf.-Cry4. [this work] 158 -- 58

CO-Pip.-Cry4. [this work] 201 332 71

1600 MUX-Array
RO-Booth Resource Utilization
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

1400

1200
Prob.-Booth
SRCSA-Wallace
HS-Vedic
CO-Ser.
CO-Unf.
AN
No. of LUTs

1000
CO-Pip.
CO-Unf.-Cry4.
800
CO-Pip.-Cry4.

600

400
DM

200

0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Word-length
Fig. 3 LUT utilization for different multiplier realizations.

MUX-Array
450 RO-Booth Resource Utilization
Prob.-Booth
SRCSA-Wallace
TE

375
HS-Vedic
CO-Ser.

300 CO-Unf.
No. of Slices

CO-Pip.
CO-Unf.-Cry4.
225 CO-Pip.-Cry4.

150
EP

75

0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Word-length
Fig. 4 Logic Slice utilization for different multiplier realizations.
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 9

2400
Resource Utilization
2200
XIP-Core
2000
CO-Ser
1800 CO-Unf.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

CO-Pip.
1600
CO-Unf.-Cry4.
No. of LUTs

US
1400
CO-Pip.-Cry4.
1200

1000

800

600

400

200
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

0
3×3
AN 5×5 7×7

Filter-Size
9×9

Fig. 5 LUT utilization convolution architecture realizations.


11×11

2200

2000
Resource Utilization
XIP-Core
1800
CO-Ser
DM

1600 CO-Unf.
CO-Pip.
No. of Flip-Flops

1400
CO-Unf.-Cry4.
1200 CO-Pip.-Cry4.

1000

800

600

400

200

0
TE

3×3 5×5 7×7 9×9 11×11

Filter-Size
Fig. 6 Flip-Flop utilization for different convolution architecture realizations.
EP
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
10 Author Names

1000 Resource Utilization


XIP-Core
CO-Ser
800 CO-Unf.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

CO-Pip.
CO-Unf.-Cry4.
No. of Slices

US
600 CO-Pip.-Cry4.

400

200
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

0
3×3
AN 5×5 7×7

Filter-Size
9×9 11×11

Fig. 7 Logic Slice utilization for different convolution architecture realizations.

4.2. Timing Analysis


Timing analysis is done to provide a measure of the productivity of the system. This is
usually quoted as the throughput of the system. For pure combinational realizations
DM

throughput is simply the inverse of the delay associated with the critical path. For
synchronous sequential circuits, throughput is determined by the maximum frequency at
which the circuit can be clocked. Table 2 lists the timing metrics of the different multiplier
architectures for an operand word-length of 16 bits. It is observed that CORDIC based
multipliers have comparatively smaller critical path delays. The critical path delays can be
further reduced by pipelining the unfolded architectures. Pipelining also results in an
interleaved operation, thereby increasing the throughput of the multiplier. The highest
clock frequency is achieved with serial architectures. This includes the ring oscillator based
Booth multiplier and CORDIC serial multiplier. However, owing to their serial nature the
resulting throughput is much less. It should, however, be noted that CORDIC serial
TE

multiplier is a word-serial architecture and the throughput will be limited by the number of
iterations. Thus, any increase in the operand word-length will have no impact on the overall
throughput of the multiplier. This is shown in fig. 8 where throughput analysis of CORDIC
based multiplier and some recently proposed multipliers is done for varying operand word-
lengths. The CORDIC serial multiplier exhibits a flat throughput response. This is
advantageous for large operand word-length multipliers which are used in some DSP
EP

applications. Table 3 lists the throughput of different multipliers for an operand word-
length of 128 bits. The CORDIC serial multiplier achieves the maximum throughput.
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 11

Table 2. Timing and Throughput analysis for different multipliers

Multiplier Design Critical Path (nS) Max. Clock Freq. (MHz) Throughput (MHz)

BAM [18-20] 25.34 -- 39.46


by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

CSM [18-20] 13.29 -- 75.245

US
CRM [18-20] 29.83 -- 33.524

WTM [18-20] 19.34 -- 51.7

VM [18-20] 25.76 -- 38.82

BSM-I [18-20] 24.24 -- 41.25

BSM-II [18-20] 20.03 -- 49.93


J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

BSM-III [18-20]

MUX-Array [21]

RO-Booth [23]

Prob.-Booth [25]
AN 16.01

22.78

4.224

15.554
--

--

521.98

--
62.46

43.9

32.63

64.3

SRCSA-Wallace [26] 16.416 -- 60.91

HS-Vedic [31] 23.339 -- 42.85


DM

CO-Ser. [this work] 3.171 667.55 83.444

CO-Unf. [this work] 13.12 -- 76.22

CO-Pip. [this work] 4.571 243.72 243.72

CO-Unf.-Cry4. [this work] 17.76 -- 56.31

CO-Pip.-Cry4. [this work] 3.067 331.24 331.24

MUX-Array
400 Throughput RO-Booth
Prob.-Booth
TE

350 SRCSA-Wallace
HS-Vedic
300 CO-Ser.
Frequency (MHz)

CO-Unf.
250 CO-Pip.
CO-Unf.-Cry4.
200 CO-Pip.-Cry4.

150
EP

100

50

0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80

Word-length
Fig. 8 Throughput variation for different multiplier realizations
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
12 Author Names

Table 3. Throughput comparison of different multipliers for an operand word-length of 128 bits

Multiplier Design Critical Path (nS) Throughput (MHz)

MUX-Array [21] 192.34 5.2


by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

RO-Booth [23] 32.56 30.71

US
Prob.-Booth [25] 132.72 7.53

SRCSA-Wallace [26] 102.45 9.7

HS-Vedic [31] 172.63 5.8

CO-Ser. [this work] 3.171 83.444

CO-Unf. [this work] 91.32 10.95


J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
CO-Pip. [this work]

CO-Unf.-Cry4. [this work]

CO-Pip.-Cry4. [this work]


32.33

117.46

23.42
30.93

8.5

42.7

Fig. 9 shows the throughput variation of the convolution architecture based on Xilinx IP
core multiplier and different CORDIC based multipliers as the kernel size is varied. It is
observed that convolution architecture based on the pipelined Carry4 CORDIC multiplier
DM
has the highest throughput. As mentioned previously, Xilinx IP core multiplier is
implemented with an optimal level of pipelining. Thus any latency concerns that are
prevalent in pipelined CORDIC multiplier based architecture will also exist in the Xilinx
IP based architecture. However, for a real time application, throughput is more important
a parameter than latency. Another interesting analysis is with regards to the critical path of
different architectures. Fig. 10 shows the critical path variation as a function of kernel size.
The Xilinx IP core based architecture has comparatively more critical path than the serial
and pipelined CORDIC based architectures. This is one of the drawbacks of using hard IP
cores. These cores are always fixed in the FPGA fabric, thereby, increasing the cost of
routing data to and from these cores. While this may not affect the throughput of the system,
TE

it increases the physical capacitance associated with these routes, resulting in greater power
dissipation as discussed in the next section.
EP
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 13

XIP-Core
CO-Ser
240 CO-Unf.
Throughput CO-Pip.
220
CO-Unf.-Cry4.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

200 CO-Pip.-Cry4.
180

US
Frequency (MHz)

160

140

120

100

80

60

40
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

20

0
3×3
AN 5×5 7×7

Filter-Size
9×9 11×11

Fig. 9 Throughput variations for different convolution architecture realizations

Critical Path
14 XIP-Core
CO-Ser
DM
12 CO-Unf.
CO-Pip.
10 CO-Unf.-Cry4.
CO-Pip.-Cry4.
Delay (nSec.)

2
TE

0
3×3 5×5 7×7 9×9 11×11

Filter-Size
Fig. 10 Critical path variation for different convolution architecture realizations

4.3. Power Analysis


EP

For power analysis all the multipliers are implemented using synchronous design practices.
Such an approach involves use of registers at the input and the output of the multiplier.
Power dissipation is a strong function of the clock frequency. It also has a direct relation
with the physical capacitances along different nodes and routes within a design. To ensure
a fair comparison, power analysis is done for a nominal clock frequency of 100 MHz for
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
14 Author Names

all the multipliers. Table 4 lists the dynamic power dissipation of different multiplier
realizations for an operand word-length of 16 bits. Serial multiplier architectures (RO-
Booth and CO-Ser.) show the least power dissipation as they utilize the minimum
underlying resources, thereby reducing the logic power dissipation. Additionally, these
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

multipliers also have smaller critical paths and thus lesser route capacitances. This further
reduces the dynamic power dissipation. Unfolded CORDIC multipliers also have lesser

US
power dissipation because of the reduced usage of logic resources. Pipelining further
reduces the power dissipation by breaking the critical paths, resulting in reduced
capacitances associated with these routes. Further analysis plots the dynamic power
dissipation as a function of operand word-length. The results are shown in fig. 11. Similar
trends are observed in convolution architectures based on different multipliers.
Architectures based on serial and pipelined multiplier realizations (XIP-Core, CO-Ser.,
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
CO-Pip., CO-Pip. Cry4.) have comparatively lesser power dissipation. The CORDIC serial
multiplier based architecture has the least power dissipation as it utilizes fewer underlying
logic resources. The results are shown in fig. 12.

Table 4. Dynamic Power dissipation for different multiplier realizations

Multiplier Design Dynamic Power Dissipation (mW)

BAM [18-20] 43.75


DM

CSM [18-20] 29.63

CRM [18-20] 42.81

WTM [18-20] 29.34

VM [18-20] 37.82

BSM-I [18-20] 43.86

BSM-II [18-20] 42.95

BSM-III [18-20] 40.03

MUX-Array [21] 37.72


TE

RO-Booth [23] 3.98

Prob.-Booth [25] 12.64

SRCSA-Wallace [26] 15.32

HS-Vedic [31] 19.26

CO-Ser. [this work] 3.47


EP

CO-Unf. [this work] 13.24

CO-Pip. [this work] 9.37

CO-Unf.-Cry4. [this work] 15.32

CO-Pip.-Cry4. [this work] 10.42


C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 15

150
MUX-Array
RO-Booth
Dynamic Power Dissipation
Prob.-Booth
120 SRCSA-Wallace
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

HS-Vedic
CO-Ser.
CO-Unf.
Power (mW)

US
90
CO-Pip.
CO-Unf.-Cry4.
CO-Pip.-Cry4.
60

30
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

0
0 4 8 12 AN
16 20 24 28 32 36

Word-length
40 44 48 52

Fig. 11 Dynamic Power dissipation for different multiplier realizations


56 60 64 68

70
Dynamic Power Dissipation
XIP-Core
60
DM
CO-Ser
CO-Unf.
50 CO-Pip.
Power (mW)

CO-Unf.-Cry4.
40 CO-Pip.-Cry4.

30

20

10
TE

0
3×3 5×5 7×7 9×9 11×11

Filter-Size
Fig. 12 Dynamic Power dissipation for different convolution architecture realizations

4.4. Error Analysis


EP

The enhanced performance of fixed-point multipliers comes at the cost of reduced


accuracy. A careful balancing of range and precision in fixed-point calculations can reduce
this problem. CORDIC based multipliers incur and additional loss in accuracy because of
their iterative nature. The foregoing analysis expresses this loss in accuracy in terms of
mean absolute percent error (MAPE). Absolute percentage error for more than 350
randomly chosen operands was recorded and mean was calculated. Our analysis for an 8-
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
16 Author Names

stage unfolded/pipelined architecture reported an MAPE of 6.032. Same MAPE was


obtained for a serial architecture that calculated the product in eight iterations. An increase
in the number of stages (or number of iterations for serial architecture) results in an
exponential decrease in MAPE. This is shown in fig. 13. A trade-off between accuracy and
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

performance will, therefore, determine the proper choice of CORDIC multiplier for a
particular application. Fig. 14 shows the increase in LUT resources for a reduced MAPE

US
for an unfolded Carry4 based CORDIC multiplier with an operand word-length of 16 bits.
Similar analysis plots the reduction in throughput for a reduced MAPE for a serial
CORDIC multiplier. The results are shown in fig. 15.

6.75

6.00
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

5.25

AN
4.50

3.75
MAPE

3.00

2.25

1.50

0.75

0.00
7 8 9 10 11 12 13 14 15 16 17

No. of Stages/Iterations

Fig. 13 Variation in MAPE with number of CORDIC stages/iterations


DM

400

350
No. of LUTs

300

250

200

150
0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75
TE

MAPE

Fig. 14 Variation in LUT utilization with MAPE

84

77
Throughput (MHz)

70
EP

63

56

49

42

0.00 0.75 1.50 2.25 3.00 3.75 4.50 5.25 6.00 6.75

MAPE

Fig. 15 Variation in throughput with MAPE


C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 17

5. Conclusions and Future scope


This work proposed a novel implementation of fixed-point multiplication based on
CORDIC algorithm. Different architectures were realized using FPGA platforms. Detailed
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

analysis revealed that a speed-up in performance is indeed achievable using CORDIC


based multipliers. This speed-up was achieved at the cost of reduced accuracy. A trade-off

US
between performance and accuracy will determine the proper choice of CORDIC
architecture for a particular application. Our analysis also reveals that the performance
speed-up using CORDIC based multipliers can also be propagated to more complex
operations like convolution. Our future endeavors will focus on using CORDIC algorithm
to perform multiply-accumulate operation. The same can be used as a building block to
realize efficient high-performance filter structures.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

Acknowledgments
AN
This work was carried out under the seed grant initiative of TEQIP-III project. The authors
are grateful to the TEQIP-III project team of IUST for their assistance and financial support
during the entire course of study.
References
1. G. L. Narayan and B. Venkataramani, Optimization Techniques for FPGA based Wave
Pipelined DSP Blocks, IEEE Transc. Very Large Scale Integr. (VLSI) syst., 13 (2005), 783-792.
DM

2. M. A. Ashour and H. I. Saleh, An FPGA Implementation guide for some different types of
Serial-Parallel Multiplier Structures, Microelectronics Journal, 31 (2000), 161-168.
3. K. Compton, S. Hauck, Reconfigurable Computing: A survey of Systems and Software, ACM
Computing Surveys, 34 (2002), 171-210.
4. IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standards Board, (2018).
5. Technical Report ANSI/IEEE Std. 754-1985, the Institute of Electrical and Electronics
Engineers (1985).
6. C. Inacio, D. Ombres, The DSP decision: Fixed point or floating? IEEE Spectrum, 33 (1996),
72-74.
7. R. Tessier and W. Burleson, Reconfigurable Computing for DSP: A Survey, Journal of VLSI
Signal Processing, 28 (2001), 7-27.
8. S. Hauck and A. Dehon, Reconfigurable Computing: The Theory and Practice of FPGA-based
TE

Computing, (Morgan Kaufmann series, 2008).


9. D. H. Timmerman, B. J. Hosticka, G. Schimdt, A Programmable CORDIC chip for Digital
Signal Processing Applications, IEEE Journal of Solid State Circuits, 26 (1991), 1317-1321.
10. W. H. Chen, C. H. Smith, S. C. Fralick, A fast Computational Algorithm for the Discrete Cosine
Transform, IEEE Transactions on Communications, 25 (1977), 1004-1009.
11. A. M. Despain, Very Fast Fourier Transform Algorithms for Hardware Implementation, IEEE
Transactions on Computers, 28 (1979), 333-341.
EP

12. A. M. Despain, Fourier Transform Computers using CORDIC Iterations, IEEE Transactions on
Computers, 23, (1974), 993-1001.
13. B. Haller, J. Gotze, J. Cavallaro, Efficient Implementation of Rotation Operations for high-
performance QRD-RLS filtering, Proceedings of the International Conference on Application
Specific Systems, Architectures and Processors, (1997).
14. J. R. Cavallaro, F. T. Luk, CORDIC Arithmetic for an SVD Processor, Journal of Parallel and
Distributed Computing, (1988), 271-290.
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
18 Author Names

15. T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk and P. Y. K. Cheung,


Reconfigurable Computing: Architecture and Design Methods, IEEE Proceedings. Computer
Digital Technology, 152 (2005), 193-207.
16. R. Naseer, M. Balakrishnan, and A. Kumar, Direct Mapping of RTL Structures onto LUT-Based
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

FPGAs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17


(1998), 624-631.
O. Kwon, K. Nowka, and Jr. Swartzlander, A 16-bit by 16-bit MAC design using fast 5:3

US
17.
compressor cells, Journal of VLSI Signal Processing, 31 (2002), 77-89.
18. S. Bhattacharjee, S. Sil, B. Basak and A. Chakarbarti, Evaluation of Power Efficient Adder and
Multiplier Circuits for FPGA based DSP Applications, Proceedings of the International
Conference on Communication and Industrial Applications (ICCIA), (2011)
19. B. Khurshid and R. Naaz, Technology Optimized Fixed-Point Bit-Parallel Multiplier for LUT
based FPGAs, International Journal of High Performance Systems Architecture, 6 (2016), 28-
35.
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

K. Kumar, V. Tyagi, H. Kukreja, S. Thakral and M. Verma, A State-of-the Art Study on


20.

21.

22.
AN
Multipliers: Advancement and Comparison, IIOAB Journal, 09 (2018), 54-66.
S. Srikanth, I. T. Banu, G. V. Priya and G. Usha, Low power array multiplier using modified
full adder, IEEE International Conference on Engineering and Technology (ICETECH),
Coimbatore, (2016), 1041-1044.
S. K. Sahoo and C. Shekhar, Delay optimized array multiplier for signal and image processing,
International Conference on Image Information Processing, Shimla, (2011), 1-4.
23. D. Okamoto, M. Kondo, T. Yokogawa, Y. Sejima, K. Arimoto and Y. Sato, A Serial Booth
Multiplier Using Ring Oscillator, Fourth International Symposium on Computing and
Networking (CANDAR), Hiroshima, (2016), 458-461.
DM

24. R. Shrestha and U. Rastogi, Design and Implementation of Area-Efficient and Low-Power
Configurable Booth-Multiplier, 29th International Conference on VLSI Design and 2016 15th
International Conference on Embedded Systems (VLSID), Kolkata, (2016), 599-600.
25. M. V. Durga Pavan and S. R. Ramesh, An Efficient Booth Multiplier Using Probabilistic
Approach, International Conference on Communication and Signal Processing (ICCSP),
Chennai, (2018), 365-368.
26. R. B. S. Kesava, B. L. Rao, K. B. Sindhuri and N. U. Kumar, Low Power and Area Efficient
Wallace Tree Multiplier using Carry Select Adder with Binary to Excess-1 Converter,
Conference on Advances in Signal Processing (CASP), Pune, (2016), 248-253.
27. D. Paradhasaradhi, M. Prashanthi and N. Vivek, Modified Wallace Tree Multiplier using
Efficient Square Root Carry Select Adder, International Conference on Green Computing
TE

Communication and Electrical Engineering (ICGCCEE), Coimbatore, (2014), 1-5.


28. S. Asif and Y. Kong, Design of an Algorithmic Wallace Multiplier using High Speed Counters,
Tenth International Conference on Computer Engineering & Systems (ICCES), Cairo, (2015),
133-138.
29. R. Suryavanshi and S. Khare, An Efficient High-Performance Vedic Multiplier: Review,
International Journal of Advanced Engineering and Management, IJOAEM, 2, (2017), 60-64.
30. J. S. Edle and P. R. Deshmukh, Application Specific Architecture of 32-Bit Vedic Multiplier,
International Conference on Computing, Communication, Control and Automation
EP

(ICCUBEA), Pune, (2017), 1-6.


31. D. K. Kahar and H. Mehta, High Speed Vedic Multiplier used Vedic Mathematics, International
Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, (2017), 356-
359.
32. B. N. K. Reddy, Design and Implementation of high performance and area efficient square
architecture using Vedic Mathematics, Analog Integrated Circuits and Signal Processing,
Springer, 102, (2020), 501-506.
C
AC
T
Accepted manuscript to appear in JCSC

IP
CR
Instructions for Typing Manuscripts (Paper’s Title) 19

33. A. K. Joginipelly and D. Charalampidis, Efficient separable convolution using field


programmable gate arrays, Microprocessors and Micrsystems, Elsevier, 71, (2019), 8-15.
34. J. E. Volder, The CORDIC Trigonometric Computing Technique, IRE Trans. Electronic
Computers, 8 (1959), 330-334.
by UNIVERSITY OF EXETER LIBRARY on 07/17/20. Re-use and distribution is strictly not permitted, except for Open Access articles.

35. D. Ercegovac, T. Lang, Digital Arithmetic, (Morgan Kaufmann, 2004).

US
J CIRCUIT SYST COMP Downloaded from www.worldscientific.com

AN
DM
TE
EP
C
AC

You might also like