A Novel Area-Power Efficient Design For Approximated Small-Point FFT Architecture

4816 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO.
12, DECEMBER 2020
A Novel Area-Power Efficient Design for

Approximated Small-Point FFT Architecture
Xueyu Han, Jiajia Chen , Boyu Qin, and Susanto Rahardja , Fellow, IEEE
Abstract—Fast Fourier transform (FFT) is an essential algo- high data rate requirement. Therefore, an area-power efficient
rithm in digital signal processing and advanced mobile communi- and high speed FFT architecture has been actively researched
cations. With the continuous development of modern technology, in recent years [5], [9]–[11].
the area-power efficient hardware implementation of FFT has
attracted a lot of attention. In this article, a novel design for FFT FFT can be implemented both in high-level software lan-
implementation is proposed. The number of resource-expensive guages and on hardware. The software solution has the
multiplications in our design is decreased by a twiddle factor advantage in flexibility but the solution is usually accompanied
merging technique that reduces the hardware area. Subsequently, with a bulky and complex hardware system which is unfriendly
a common subexpression sharing scheme is applied to reuse the to small-sized and battery-based devices [12], [13]. Motivated
hardware resources to further save the hardware area. In addi-
tion, a magnitude-response aware approximation algorithm is by the challenging requirements in hardware area and power
proposed for applications where the transformation accuracy can budget, many hardware solutions for FFT implementation
be compromised a little bit for lesser hardware area and power have been developed for mobile and smart devices [14], [15].
dissipation. Logic synthesis shows that the proposed 16-point FFT As technology matures, application-specific integrated cir-
architecture can save hardware area and power dissipation on cuit (ASIC) has become more cost effective for some generic
application-specific integrated circuit (ASIC) by up to 65.7% and
53.1% compared with recently published designs. Similarly, the classes of problems in digital signal processing such as the
proposed 32-point FFT architecture achieves up to 58.8% reduc- FFT [16]. ASIC solution provides the ability to manage power
tion on hardware area and 60.0% reduction on power dissipation and exploit the robustness after processing in digital domain.
on ASIC. Existing hardware solutions for FFT implementation can
Index Terms—Approximated fast Fourier transform (FFT), be mainly divided into reconfigurable and fixed architectures.
common subexpression sharing (CSSsharing), twiddle factor Reconfigurable architecture [17]–[19] is generally developed
merging (TFMerging). for variable-length FFTs. Mixed-radix algorithms are com-
mon solutions to implementing a variety of transformation
lengths. Fixed FFTs can be further categorized into pipelined
I. I NTRODUCTION and parallel architectures. The most classical approaches for
pipelined FFTs [20]–[24] are multipath delay commutator and
AST Fourier transform (FFT) is a widely used algo-
F rithm in digital signal processing [1]–[3] and wireless
communication systems [4]–[8]. In long term evolution (LTE)
single-path delay feedback. The basic structure of these two
approaches consists of a processing element (PE) for data com-
putation and the required memory for data storage. Parallel
and its high-speed versions LTE-Advanced/LTE-Advanced
architectures on the other hand [25]–[29] can be constructed
Pro [7], FFTs with different transformation sizes are desired.
by decomposing the FFT algorithm into several partitions and
Moreover, FFT generates the required radio frequency (RF)
using a combination of PEs to compute these partitions in
beams for multibeam beamforming which is one of the
parallel. Different FFT architectures have different advantages
significant techniques in the fifth generation (5G) wireless
and disadvantages in terms of hardware complexity and com-
communication [8]. Nowadays, as most of mobile communi-
putation speed. Pipelined FFTs have simpler architectures but
cations are designed to be implemented on portable devices,
larger latency as data are processed sequentially. In addition,
the embedded FFT processor is required to have low hardware
complex controller is required in pipelined architectures. On
area and power dissipation. It is also vital that the computa-
the contrary, parallel architectures can deal with N inputs
tion speed of the FFT processor is high enough to support
simultaneously and have the advantage of easy control but
Manuscript received October 16, 2019; revised January 8, 2020; accepted at the expense of more hardware resources.
February 29, 2020. Date of publication March 6, 2020; date of cur- It is challenging to implement FFT on hardware to achieve
rent version November 20, 2020. This work was supported by the
Nanjing University of Aeronautics and Astronautics, Nanjing, China, under high accuracy and low hardware cost. However, the accuracy
Grant 56YAH18043. This article was recommended by Associate Editor can be relaxed in some applications to achieve a substantially
R. Drechsler. (Corresponding authors: Jiajia Chen; Susanto Rahardja.) reduction on hardware cost and power dissipation. To do this,
Xueyu Han and Susanto Rahardja are with the School of Marine Science
and Technology, Northwestern Polytechnical University, Xi’an 710072, China approximation [30] has become imperative for the tradeoff.
(e-mail: susantorahardja@ieee.org). Recent published works in [24], [27], and [28] improved exist-
Jiajia Chen and Boyu Qin are with the College of Electronic and ing FFT architectures based on approximate computing. An
Information Engineering, Nanjing University of Aeronautics and Astronautics,
Nanjing 211106, China (e-mail: jiajia_chen@nuaa.edu.cn). inexact pipelined FFT accelerator was proposed in [24]. A sta-
Digital Object Identifier 10.1109/TCAD.2020.2978839 tistical learning technique based on normalized least mean
0278-0070
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: GLA University. Downloaded on November 05,2022 at 07:27:50 UTC from IEEE Xplore. Restrictions apply.
HAN et al.: NOVEL AREA-POWER EFFICIENT DESIGN FOR APPROXIMATED SMALL-POINT FFT ARCHITECTURE 4817
square algorithm is used to train the inexact FFTs, where

separate training and test sets are requisite. Han et al. [27]
proposed an efficient parallel FFT architecture where the
infinite-precision coefficients are approximated by cutting off
their insignificant bits. However, it may not be the most hard-
ware efficient solution when the error introduced is still less
than the transformation error allowed. Another approximated
parallel FFT architecture was recently proposed in [28] where
the design was simplified to be multiplierless. Although a low
area-power design is achieved in [28], the transformation error
caused by its heavy approximation is relatively high.
In this article, a novel design for area-power efficient FFT
architecture is proposed as a solution to address the above
challenges. First, a new twiddle factor merging (TFMerging)
technique is proposed where the low-cost unified adder–
subtractor (UAS) operator [31] is utilized for the merged
expressions. Second, duplicated multiplications in the merged
expressions are implemented by one sharable structure and
a dedicated common subexpression (CS) sharing algorithm is
proposed to maximize the sharing. Third, we propose a novel
Fig. 1. 16-point DIT-FFT algorithm.
approximation approach for the FFT coefficients under dif-
ferent requirements of transformation error. The approach
approximates the FFT coefficients, which takes care of both
bit sensitivity to the transformation error and the reduction in
hardware cost. The proposed new approach has two advan-
tages. The first one is the ability to generate more efficient
approximation solution by utilizing the margin between the Fig. 2. Butterfly block.
maximum error allowed and the error caused by the direct
truncation. The second one is the ability to achieve more
subexpression sharing by generating more approximated FFT FFT was proposed. The simplest and most commonly used
coefficient candidates. They contribute to the design of effi- is Cooley–Tukey radix-2 decimation-in-time (DIT) FFT [32].
cient FFT without the need of extra training and test sets. In Radix-2 DIT-FFT decomposes an N-point DFT into smaller
addition, the approximation approach can be applied with any DFTs by dividing the even- and odd-indexed input samples
given transformation error constraints to provide a solution into two parts as
with good quality. The proposed FFT architectures are com-
pared with the state-of-the-arts and the logic synthesis results X(k) = X1 (k) + WNk X2 (k) (2)
show that our designs successfully reduce the hardware area N/2−1
and power dissipation. where X1 (k) = n=0 x(2n)WN/2 and X2 (k)
nk =
N/2−1
n=0 x(2n + 1)WN/2 are two N/2-point DFTs. By recur-
nk
This article is organized as follows. Section II intro-
duces some prerequisite preliminaries for FFT algorithm and sively applying (2), an N-point DFT can be decomposed into
presents the problem formulation. Section III presents the log2 N stages as shown in Fig. 1 and each stage contains
proposed algorithm for FFT architecture design. The logic syn- a total of N/2 2-point DFTs. The butterfly block depicted in
thesis results of our architectures and competing designs are Fig. 2 is utilized to implement these 2-point DFTs based on
given in Section IV. Finally, a conclusion is given in Section V. the symmetric property of the twiddle factors
nk+ N2
II. P RELIMINARIES AND P ROBLEM F ORMULATION WN = −WNnk . (3)
A. DFT and FFT Preliminaries In each butterfly block, one CM, one CA, and one complex
The 1-D DFT of an input signal sequence x(n) is given by subtraction are required. Therefore, the computational com-
plexity of an N-point FFT decreases remarkably from O(N 2 )

N−1
X(k) = x(n)WNnk , n = 0, 1, . . . , N − 1 (1) to O(Nlog2 N) compared with the direct DFT.
n=0
where X(k) is the frequency domain representation of x(n) B. Problem Formulation
with k = 0, 1, . . . , N − 1. WNnk = exp(−j2π nk/N) is the According to (1) and (2), CM and CA are the two arith-
twiddle factor. Since N 2 complex multiplications (CMs) and metic operations required in DFT/FFT. The numbers of CMs
N(N − 1) complex additions (CAs) are required in (1), the and CAs needed in N-point DFT and DIT-FFT architectures
computational complexity of an N-point DFT grows quickly are given in the second and third columns of Table I. In
with increasing N. In order to compute DFT more efficiently, hardware implementation, CMs and CAs are converted to
4818 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 12, DECEMBER 2020
TABLE I
R EQUIRED R ESOURCES IN N-P OINT DFT AND
DIT-FFT A RCHITECTURES
operations on their real and imaginary parts. The numbers of

real multiplications (RMs) and real additions (RAs) required Fig. 3. Multiplication of NTFs across stages in an N-point radix-2 DIT-FFT
architecture.
in N-point DFT and DIT-FFT architectures are computed and
given in the fourth and fifth columns, respectively, in Table I. III. P ROPOSED M ETHOD
In many previous works, RMs are implemented by shift
and add network to achieve a low hardware cost [33], [34]. A. Twiddle Factor Merging
Knowing that the shifts can be implemented by direct hard- In an FFT architecture, multiplication is the most resource-
wiring, addition is the core operation that requires look-up expensive arithmetic operation. To minimize the FFT hard-
table (LUT) count in FPGA. Therefore, the total number ware cost with a limited transformation error, we propose
of RAs can be used to estimate the required hardware a TFMerging method to reduce the number of multiplications.
resources. We assume all RAs are implemented with ripple Twiddle factor multiplications (TFMs) in FFT architecture can
carry adders (RCAs) whose complexity is proportional to its be divided into two categories: 1) trivial TFM and 2) nontrivial
wordlength [35], [36] to have a more accurate estimation. By TFM. Trivial TFMs refer to the multiplications with twiddle
doing so, the number of full adders (FAs) in the carry chain of factors which can be implemented by direct hardwiring, such
RCA can be used as the cost of each particular RA. For imple- as multiplying the input with −1, 1, or −j. Nontrivial TFMs
mentation on ASIC, although there are many different types are multiplications which need to be implemented by shift
of FA cell, the number of FAs needed gives valid information and add network. Only those nontrivial TFMs are actually
about the hardware complexity. Therefore, the total FA count resource-expensive multiplications of which the number needs
is used as an estimation for the hardware cost of the FFT to be reduced. Based on the conventional radix-2 DIT-FFT
architecture, which is denoted as final_cost. architecture, we merge two nontrivial twiddle factors (NTFs)
s s
Besides the hardware cost, transformation accuracy is WNp and WNp+1 , in two adjacent stages p and p + 1, into a new
another major consideration. Given an input signal sequence NTF, as shown in Fig. 3. r, t, u, and v are four intermediate
x = [x(0), x(1), . . . , x(N − 1)], the exact FFT outputs X = results in the FFT architecture. R is computed as
[X(0), X(1), . . . , X(N − 1)] are computed by (2). To imple- s s
s

ment FFT on hardware, infinite-precision twiddle factors R = r + WNp t + WNp+1 u + WNp v
s s s s
in (2) have to be approximated to finite-precision. When = r + WNp t + WNp+1 u + WNp+1 WNp v. (6)
approximated twiddle factors are used in (2), the outputs cor-
responding to the given input x can be denoted as Xappro = Using Euler’s rule to re-express the twiddle factor, we have
[Xappro (0), Xappro (1), . . . , Xappro (N − 1)]. The corresponding
2π nk 2π nk 2π nk
error E caused by the approximation is evaluated based on WN = exp −j
nk
= cos − j sin (7)
N N N
the root-mean-square error (RMSE) between Xappro and X as
follows: where cos(2π nk/N) and sin(2π nk/N) are twiddle factor
s s
coefficients (TFCs). The term WNp+1 WNp in (6) can then be
N−1 2 represented by a single NTF
k=0 X(k) − Xappro (k)
E= . (4)
N sp sp+1 2π sp + sp+1 2π sp + sp+1
WN WN = cos − j sin
With the estimated hardware cost final_cost and the com- N N
sp +s
puted error E, the design problem can then be formulated = WN p+1 . (8)
as a minimization of final_cost provided that E is less than
Therefore, each TFM in (6) is decomposed into two
required
multiplications with nontrivial TFCs. Similar to nontrivial
TFMs, nontrivial TFCs are coefficients which need to be
Minimize{final_cost}, s.t. E ≤ δ (5)
implemented by shift and adder network. When sp+1 =
where δ is the maximum RMSE allowed. To solve (5), N/4 − (sp + sp+1 ), (6) is rewritten as
we propose an algorithm to perform TFMerging which
2π sp 2π sp+1 2π sp+1
reduces the number of multiplications in FFT architec- R = r + cos t + cos u + sin v
N N N
ture and increases hardware resource sharing to reduce
2π sp 2π sp+1 2π sp+1
final_cost. Moreover, a magnitude-response aware approxima- − j sin t + sin u + cos v . (9)
tion approach is proposed to further reduce final_cost while E N N N
is monitored to be no more than δ. The details of the algorithm Benefiting from the distributive operation of multiplication
are presented in the next section. in (6) and the TFMerging in (8), the computation of the FFT
output is finally re-expressed by the sum of terms where the Algorithm 1 Pseudo Code of the CSSharing Algorithm
input is multiplied with nontrivial TFCs like (9). In such case, it Input: C
Output: CS_Final
is highly possible that these addition terms in (9) can be shared CS_Share = Find_CS(C);
and the number of multiplications with nontrivial TFCs is n = Size(CS_Share);
reduced correspondingly compared to the conventional radix-2 for i from 1 to n
CS_Selected = CS_Share[i];
DIT-FFT architecture. For example, in stage 3 and 4 of Fig. 1, updated_C = Remove(CS_Share[i], C);
the data path in red for computing X(1) corresponds to the CS_Additional = Find_CS(updated_C);
data path for computing R which is also marked in red in Fig. 3. CS_Selected = Insert(CS_Selected, CS_ Additional);
s cost[i] = FA_count(CS_Selected);
The parameters in Fig. 3 are specified as p = 3, WNp = W16 2 ,
end for
sp+1
and WN = W16 . r, t, u, and v are computed as
1 CS_Final = Min_cost(cost);
r = (x(0) − x(8)) − j(x(4) − x(12))

t = (x(2) − x(10)) − j(x(6) − x(14)) where k is the kth frequency component of the input x. When
u = (x(1) − x(9)) − j(x(5) − x(13)) k is even, i.e., W2k = 1, elementary addition terms in X(k)
v = (x(3) − x(11)) − j(x(7) − x(15)). are x(0) + x(8), x(1) + x(9), and so on. While k is odd
(i.e., W2k = −1), elementary subtraction terms include x(0) −
By using (9), the computation of X(1) is finally expressed as x(8), x(1) − x(9), and so on. Since the TFMerging technique
is applied to nontrivial TFCs, elementary addition/subtraction
X(1) = (x(0) − x(8)) terms remain unchanged after merging. Because the hard-
2π ware area for implementing a unified adder-subtractor [31]
+ sin [(x(2) − x(10)) − (x(6) − x(14))]
8 operator is lower than the total hardware areas of one
π
− sin [(x(5) − x(13)) − (x(3) − x(11))] adder and one subtractor, we utilize the UAS to replace the
8 adders and subtractors in the first stage of the proposed FFT
π
+ cos [(x(1) − x(9)) − (x(7) − x(15))] architecture.
8
− j (x(4) − x(12))
B. Common Subexpression Sharing
2π
+ sin [(x(2) − x(10)) + (x(6) − x(14))] Hardware resources sharing exists when different nontriv-
8 ial TFCs are multiplied with the same input as shown in (9)
π
+ cos [(x(5) − x(13)) + (x(3) − x(11))] as well. In such case, multiplications of the input with these
8 nontrivial TFCs can be implemented simultaneously using
π
+ sin [(x(1) − x(9)) + (x(7) − x(15))] . a merged structure. In this article, we transform all TFCs into
8
canonical signed digit (CSD) representation [37]. An M-bit
Similar to the computation of X(1), other outputs with CSD representation cM−1 cM−2 · · · c0 of a decimal number C
k = 0, 2, 3, . . . , 15 can be computed by using the proposed is derived by
TFMerging method. It is obvious that only six multiplications
with nontrivial TFCs are required for computing X(1) using
M−1
the proposed TFMerging method while eight multiplications C= ci 2i (10)

are required for the nontrivial TFCs (i.e., 3 complex nontrivial i=0
TFMs) in Fig. 1. In addition, among the six multiplications for − −
computing X(1), two multiplications with sin(2π /8) are com- where ci ∈{ 1 , 0, 1}, 1 denotes −1. In CSD representation,
mon to X(3), X(5), and X(7), and the rest four multiplications a weight-two subexpression is defined as a string of 0s starting
with sin(π /8) and cos(π /8) are common to X(7). Structures and ending with nonzero digits. If one subexpression appears
for implementing these common terms can be reused in the more than once, it is called CS. Structures to implement CS
FFT architecture and therefore a large amount of hardware in different TFCs can be merged into a single structure where
resources can be saved. the hardware resources are reused. Therefore, we propose
A basic structure for implementing the above common terms a CS sharing (CSSharing) algorithm which is summarized in
is the addition/subtraction of two input signals. We define the Algorithm 1. In an N-point FFT, we denote nontrivial TFCs
elementary addition/subtraction term as the computation of the which are multiplied with the same input as a set C. The algo-
sum/difference of two input signals. Let us take the 16-point rithm starts with generating all CSs that can be shared and
FFT as an example. By recursively using (2), the 16-point FFT stores them into a set CS_Share using the function Find_CS.
is expressed as The function Size counts the total number of CSs in CS_Share.
For each CS, the function Remove is used to remove it from C
X(k) = x(0) + W2k x(8) + W4k x(4) + W2k x(12) before Find_CS finds whether there is another CS that can be
further shared. With the selected CSs, the hardware cost of the
+ W8k x(2) + W2k x(10) + W4k x(6) + W2k x(14) corresponding FFT architecture is evaluated by the function

+ W16
k
x(1) + W2k x(9) + W4k x(5) + W2k x(13) FA_count. At last, the CSs resulting in a minimum hardware
cost are chosen by the function Min_cost and returned by the
+ W8k x(3) + W2k x(11) + W4k x(7) + W2k x(15) algorithm.
C. Magnitude-Response Aware Approximation to Twiddle Algorithm 2 Pseudo Code of the AFFT_ECS Algorithm
Factor Coefficients Input: C, existing_CS, δ
Output: C
All the infinite-precision TFCs are first transformed into M- E = error(C);
bit CSD representations by cutting off the insignificant bits of while (E ≤ δ) {
updated_C = Remove(existing_CS, C);
TFCs. If the precision of TFCs is specified as K-bit (K ≤ M), D = Count_nonzerodigit(updated_C);
the truncation operation directly cuts off the (M − K) least n = Size(D);
significant bits (LSBs). The error E∗ caused by the trunca- for i from 1 to n
(gcomp [i], gzero [i]) = Gain(updated_C[i]);
tion operation can be evaluated by (4). When a maximum end for
allowed transformation error δ is specified for certain appli- final_C = Max_Gain(gcomp , gzero , C); E = error(final_C);
cation, there is usually a small margin between E∗ and δ. To if E ≤ δ
C = final_C;
better utilize the margin, the truncated TFCs can be further else
approximated by changing some digits in the TFCs to reduce C_set = Rank_Gain(gcomp , gzero , C);
total FA count as long as the error is not bigger than δ. The for j from 2 to n
E = error(C_set[j]);
challenge is how to develop an efficient measure to minimize if E ≤ δ
the total FA count during the approximation while the error C = C_set[j];
is being kept below δ. To address this, a novel magnitude- break;
end if
response aware approximation approach is proposed in this end for
section. end if }
Two methods are considered in the proposed approxima-
tion approach. The first one is to change less significant
nonzero digits of nontrivial TFCs to zero so that the num-
ber of adders used for implementing the corresponding TFMs 1) Approximation Based on Existing Common
is reduced. The second method is to change less significant Subexpressions: If CSs exist in different nontrivial TFCs, we
nonzero digits to their complement so that opportunities for propose an approximated FFT algorithm based on existing
sharing CSs between nontrivial TFCs are created. These two CSs (named as AFFT_ECS) in this article. First of all, an
methods are simultaneously applied to different nonzero digits optimal CSSharing solution is generated by the proposed
of TFCs which have different impacts on the total FA count algorithm. With the optimal CSs unchanged, the AFFT_ECS
and transformation error. The effect on the total FA count by algorithm iteratively changes some of the remaining nonzero
approximating the ith TFC at the jth nonzero digit counting digits in TFCs which do not exist in any CS yet. In each
i,j i,j
from the most significant bit is iteration, gzero and gcomp are computed for all nonzero digits
by (13). The approximation is performed to the nonzero digit
FAbe − FAaf i,j i,j
which has the biggest gzero /gcomple , provided that the corre-
ci,j = (11)
FAbe sponding transformation error Eaf is less than the maximally
where FAbe and FAaf are the total FA count of the correspond- allowed error δ. If Eaf is bigger than δ due to the change of
ing FFT implementation before and after one approximation the nonzero digit with the biggest gain, the algorithm seeks
method is adopted, respectively. By evaluating the transforma- the digit with the second biggest gain. The approximation is
tion error using (4), the sensitivity of the jth nonzero digit in performed when the error caused by changing the digit with
the ith TFC with respect to transformation error is defined as the second biggest gain is smaller than δ. Otherwise, the
algorithm continues to seek the digit with the third biggest
Eaf − Ebe gain. The iteration continues while Eaf is always evaluated to
si,j = (12)
Ebe decide if any nonzero digit should be changed. If Eaf caused
where Ebe and Eaf are the transformation errors before and by the changing of the digit with the least gain is still bigger
after the nonzero digit is changed, respectively. It is obvious than δ, the algorithm stops and the approximation to this set
that changing a nonzero digit with a larger c and a smaller of TFCs completes. The main steps of the AFFT_ECS are
s leads to more effective improvement. To evaluate these two summarized in Algorithm 2.
measures by using one metric, we define the gain of changing The function error computes the error of the FFT imple-
the jth nonzero digit in the ith TFC on the total FA count and mentation using the approximated TFC set C. The function
transformation error as Count_nonzerodigit counts the total number of nonzero digits
of updated_C after removing existing CSs. For each nonzero
ci,j digit in updated_C, the function Gain measures its gain.
gi,j = . (13)
si,j The nonzero digit that has the biggest gain is selected and
It is evident that changing the nonzero digit with a larger gain the TFC set is changed by the function Max_Gain accord-
contributes more efficient solution. Since the approximation ingly. The function Rank_Gain ranks the gains in descending
to one nonzero digit can be done by either changing to zero order. Finally, the algorithm returns the TFC set where all the
or the complement, the respective gains can be denoted as qualified nonzero digits are approximated.
i,j i,j
gzero and gcomp which are evaluated by (13). With the above 2) Approximation by Creating New Common
denotations and definitions, two approximated algorithms are Subexpressions: For small size FFTs, the number of nontrivial
presented in the next sections. TFCs is limited. As a consequence, it is likely that no CS
Algorithm 3 Pseudo Code of the Error Compensation Algorithm 4 Pseudo Code of the AFFT_NCS Algorithm
Technique Input: C, δ
Input: C Output: final_C
Output: C, Einitial D = Count_nonzerodigit(C);
Einitial = error(C); n = Size(D);
while (true) { for i from 1 to n
D = Count_nonzerodigit(C); sharable_CS = ∅;
n = Size(D); appro_C = ∅;
for i from 1 to n new_C = Change_to_complement(C[i]);
compensated_C[i] = Change_to_zero(C[i]); CS_set = Generate_CS(new_C);
E[i] = error(compensated_C[i]); sharable_CS = Select_CS(CS_set);
end for while (sharable_CS = ∅) {
(min_E, new_C) = Min_error(E, compensated_C); new_CS = Shortest_CS(sharable_CS);
if min_E ≤ Einitial Einitial = error(new_C);
Einitial = min_E; if Einitial ≥ δ
C = new_C; (adapt_C, Einitial ) = Error_Compensate(new_CS, new_C);
else end if
break; if Einitial = δ
end if } appro_C[i] = AFFT_ ECS (adapt_C, new_CS, δ);
cost[i] = FA_count(new_CS);
end if }
end for
if appro_C = ∅
can be shared by TFCs at the beginning. Moreover, even CSs (min_cost, final_C) = Min_cost(cost);
exist initially, fixing them as in AFFT_ECS algorithm may else
hinder the TFCs from being further approximated to achieve final_C = C;
end if
a better solution. For example, if the existing CS is located
at less significant bit position in a TFC, we can only change
the rest nonzero digits at more significant bit positions which
causes bigger transformation error. In the above-mentioned error compensation is applied when the corresponding trans-
two circumstances, we propose to approximate TFCs with the formation error exceeds. After that, the AFFT_ECS algorithm
freedom of creating new CS by changing a nonzero digit to its proposed in Section III-C1 is performed thereafter for further
complement. Eaf caused by this approximation operation may approximation.
exceed the maximally allowed error δ. However, this does not This process is applied to every nonzero digit in the same
mean that the approximation cannot be performed because way as described above. The total FA count is computed for
the unacceptable error can be compensated by changing each implementation and the approximated TFC set which
other nonzero digits in the same TFCs. To achieve this, we results in the lowest FA count is returned as the final solution
propose an error compensation technique to adapt TFCs for by the AFFT_NCS. The main steps of the algorithm are sum-
compensating the error before it is compared with δ. The marized in Algorithm 4. The function Change_to_complement
algorithm starts with computing the initial transformation changes a particular nonzero digit to its complement and
error using the TFCs in which one nonzero digit is changed returns the approximated TFC set. All the CSs are saved
to create new CS. For each of the remaining nontrivial digits in CS_set using the function Generate_CS. The function
not appearing in the new CS, it is changed to zero and the Eaf Select_CS selects all CSs that can be shared and saves them
is recomputed correspondingly. The minimum Eaf is selected into sharable_CS. For each element in sharable_CS, the func-
and compared with the initial transformation error after the tion Shortest_CS chooses the CS of the shortest length as the
new CS is generated. If the error decreases, the algorithm newly created CS for further TFC approximation.
moves on to change the next nonzero digit to zero until the With the proposed TFMerging technique, CSSharing and
transformation error stops decreasing. The main steps of magnitude-response aware approximation algorithm, a com-
the algorithm are summarized in Algorithm 3. The function plete approximated FFT architecture design algorithm is estab-
Change_to_zero changes a nonzero digit which does not lished. First of all, the proposed TFMerging technique is
exist in CS and returns a new TFC set compensated_C. The applied to an N-point FFT to generate nontrivial TFCs to
function Min_error selects the minimum error and returns the be approximated. Next, we apply the CSSharing method and
approximated TFC set which produces this error. magnitude-response aware approximation algorithm to further
With the error compensation, we propose an approximated reduce the hardware complexity, with the maximally allowed
FFT algorithm by creating new CS (named as AFFT_NCS). transformation error δ. With the nontrivial TFCs, we first
If there are CSs existing in TFCs initially, they are ignored check if there are existing CSs that can be shared. If no,
and all nonzero digits are considered equally when creating the AFFT_NCS algorithm is applied to approximate TFCs.
new CS. For each nonzero digit in TFCs, the algorithm first Otherwise, the CSSharing method is applied to provide a solu-
changes it to its complement. All the remaining nonzero dig- tion for resource sharing before the AFFT_ECS algorithm is
its in the same TFC take turns to be examined. Once a new applied to further approximate nontrivial TFCs. Though a good
CS is found, it is fixed and the algorithm stops creating more. solution can be returned by the AFFT_ECS algorithm, the
This is because of the limited number of nonzero digits exist- fixed CSs create a barrier for further approximation. Therefore,
ing in TFC. When one CS is fixed, there is little chance the AFFT_NCS algorithm is also applied in this situation to
that the remaining nonzero digits can form other CSs. The provide an alternative even though CS exists initially. Two
Algorithm 5 Pseudo Code of the TFC_Approximation

Algorithm
Input: coefficients, δ
Output: final_C, final_cost
Initialize final_cost = 8;
C = TF_merging(coefficients);
minimum_wordlength = Initialize_TFC(C, δ);
for w = minimum_wordlength ++
truncated_C = Truncate(C, w);
existing_CS = Find_CS(truncated_C);
if existing_CS = ∅
fixed_CS = CSSharing(truncated_C);
approximated_CE = AFFT_ECS(truncated_C, fixed_CS, δ);
approximated_CN = AFFT_NCS(truncated_C, δ);
cost_ECS = FA_count(approximated_CE );
cost_NCS = FA_count(approximated_CN );
if cost_ECS<cost_NCS
approximated_C = approximated_CE ;
else
approximated_C = approximated_CN ;
end if
else
approximated_C = AFFT_NCS(truncated_C, δ);
end if
cost = FA_count(approximated_C);
if cost > final_cost
break;
else
final_cost = cost;
final_C = approximated_C;
end if
end for
solutions are compared at last in terms of total FA count

of the corresponding architectures. The final solution is the
one which is with the minimum cost. The main steps of the
overall algorithm (named as TFC_approximation) are sum-
marized in Algorithm 5. The function TF_merging performs
the TFMerging as presented in Section III-A and returns the
nontrivial TFCs. The function Initialize_TFC determines the
initial precision of nontrivial TFCs by truncating insignif-
Fig. 4. Flow chart of the proposed TFC_approximation algorithm.
icant bits with the constraint on transformation error. The
minimum wordlength of TFCs which makes the error to be
lower than δ is selected as initial precision. However, longer TABLE II
Q UANTITATIVE I MPROVEMENT OF E ACH P ROPOSED M ETHOD
TFC wordlength than the initial length does not necessarily
cause higher FA count because longer lengths can provide
more opportunities for such approximation. Therefore, after
applying the initial precision, the coefficient wordlength is
increased until further increment can no longer contribute to
FA count reduction. The function Find_CS searches CS in the
truncated TFCs.
With the proposed algorithm, we generate the approxi-
mated TFCs and the minimized cost for an FFT imple- A. Design Example
mentation. The problem formulated in (5) is then solved. We present the design flow of 16-point FFT architectures to
To clearly show the entire algorithm, the flow chart of demonstrate the proposed algorithm. Since the proposed algo-
the proposed TFC_approximation algorithm is presented rithm needs the maximum error allowed to run, we assume
in Fig. 4. four specific transformation error requirements. The first one
δ1 = 5.3e−5 is the transformation error of one competing
design in [14] where the precision of TFCs is 10-bit such that
IV. L OGIC S YNTHESIS R ESULTS AND D ISCUSSION the FFT implementation is virtually exact. If the maximum
In this section, 16- and 32-point approximated FFT archi- error allowed is relaxed to δ2 = 1.9e−3 and δ3 = 1.9e−2 ,
tectures are designed using the proposed algorithm. They are there is more freedom to approximate TFCs to achieve lower
compared with two recently published works and the results FA cost. The last one δ4 = 1.7e−1 is the transformation
by logic synthesis are presented and discussed. error of another competing design proposed in [28]. The TFCs
Fig. 6. CS-shared structure when the maximum error allowed is 1.9e−2 .
TABLE III
T OTAL FA C OUNT AND T RANSFORMATION E RROR OF THE FFT D ESIGNS
further by 179. The reason why we apply TFC_approximation

Fig. 5. (a) Unit impulse input in time domain. (b) Frequency magnitude and CSSharing together is because that the objective of
responses of the input by exact FFT. (c) Frequency magnitude responses of approximation is to create more opportunity for sharing and
the input by approximated FFT (transformation error is 5.3e−5 ). (d) Frequency the benefit by the approximation must be realized by the subse-
magnitude responses of the input by approximated FFT (transformation error
is 1.9e−3 ). (e) Frequency magnitude responses of the input by approximated quent sharing. From Table II, it is obvious that the TFMerging
FFT (transformation error is 1.9e−2 ). (f) Frequency magnitude responses of technique brings significant improvement to FA count reduc-
the input by approximated FFT (transformation error is 1.7e−1 ). tion. Moreover, the proposed TFC_approximation algorithm
creates opportunities for reusing hardware which can be
achieved by the CSSharing. An additional 179 FA count is
saved consequently.
in [28] were approximated heavily so that the correspond- With the approximated TFCs returned by Approx algo-
ing error is much higher. Therefore, we set δ 4 as the biggest rithm, the CS-shared structure for implementing cos(π /8) and
error requirement in our experiment. For the ease of compar- sin(π /8) is shown in Fig. 6. The 16-point approximated FFT
ing the four transformation errors, we assume a unit impulse architecture for this design example is depicted in Fig. 7.
input in time domain [as shown in Fig. 5(a)] and transform The UAS operator performs addition and subtraction of two
it into its frequency domain representations. The frequency input signals simultaneously. The RIC block combines real
magnitude responses of exact FFT, approximated FFTs with and imaginary parts generated by the proposed architecture
the maximum error allowed being 5.3e−5 , 1.9e−3 , 1.9e−2 , and lists all the final outputs for 16-point FFT. It is obvious
and 1.7e−1 are shown in Fig. 5(b)–(f), respectively. The four that only 12 multiplications with nontrivial TFCs (one shared
transformation errors demonstrate examples of virtually exact, structure performs two multiplications at a time) are required
moderately and excessively approximated FFTs. Because the in this design. This is much less than that of Fig. 1 where
algorithmic flow is the same for different errors, we take 40 multiplications with nontrivial TFCs are involved.
δ3 = 1.9e−2 as one example. The first step is to perform
the TFMerging. All the expressions for the computation of
FFT outputs are re-expressed by merging nontrivial TFCs. B. Results and Discussion
The second step is approximation. All the nontrivial TFCs The above four 16-point approximated FFT architectures
are processed by the proposed approach under the constraint by our algorithm are compared with three state-of-the-art
that the transformation error is no more than 1.9e−2 . The works proposed in [5], [14], and [28]. Since the FFT pro-
algorithm terminates when the wordlength of TFCs is 7-bit cesser in [14] is multiradix which supports a number of
along with the usage of CSSharing scheme and the final total transformation sizes, we extract the 16-point FFT design
FA count is 1033. Table II shows the quantitative improve- by using its radix-16 core. The extracted 16-point FFT
ment of each proposed method in our algorithm. The design core only involves necessary hardware to compute 16-point
of [14] is chosen as a competing method. We first apply FFT. Redundant resources which cause unfair comparison
the TFMerging technique to a radix-2 DIT-FFT architecture are not included. Similarly, 32-point FFT architectures are
which is with the same transformation error with the design also designed using the proposed algorithm and compared
of [14]. The FA count saving by the merging technique is with [14]. As the design in [28] is for 16-point approximated
472. Next, we apply the magnitude-response aware approx- FFT implementation only, we do not compare our 32-point
imation (TFC_approximation) algorithm to nontrivial TFCs architectures with [28]. The total FA count (#FAs) and the
together with the CSSharing, which reduces the FA count error of the FFT designs are given in Table III. Among
Fig. 7. Approximated 16-point FFT architecture.
TABLE IV
the proposed FFT designs, AFFT1 and AFFT5 are with the C OMPARISON B ETWEEN THE FFT D ESIGNS ON FPGA. (a) 16-P OINT
same ransformation errors as the 16- and 32-point designs FFT D ESIGNS . (b) 32-P OINT FFT D ESIGNS
in [14], respectively, while the others are less accurate. To
compare the hardware cost more accurately, all designs are
described in Verilog HDL and mapped to Xilinx Virtex7,
xc7s75fgga484 FPGA device. Xilinx Vivado Design Suite
v17.4 is used to synthesize the designs. The number of LUTs
(#LUTs), the utilization density of LUTs, the number of IOs
(#IOs), the utilization density of IOs and delays in ns of 16-
and 32-point FFT designs are shown in Table IV. At least
41.2% and 56.4% improvements are achieved by our 16- and
32-point designs, respectively, over the designs of [5] and [14]
in terms of #LUTs. Our designs have shorter delays com-
pared with [5] and [14]. The reason is that the merging
technique reduces the number of multiplications and there-
fore #FAs in the critical path is reduced. Design of [5] has
larger delays since the iteration operation in CORDIC scheme
lead to more adders in the critical path. The reason why
AFFT4 and AFFT8 reduce #LUTs dramatically over the 16-
and 32-point designs in [5] and [14] is because their high
transformation error tolerance cause excessively approximated
TFCs, with which all the multiplications in the FFT archi-
tecture can be implemented by direct hardwiring. Compared
with another excessively approximated 16-point FFT design The on-chip memory requirement of the designs are
proposed in [28], AFFT4 saves 8.4% FPGA area benefitting presented in Table V. The proposed FFT implementations have
from the proposed techniques. The FPGA areas of the 16- lower register cost, because the proposed merging technique
and 32-point approximated FFT designs are plotted in Fig. 8. reduces the number of temporary TFM partial products which
The FPGA areas of the 16- and 32-point FFT designed by requires storage.
using the conventional radix-2 DIT-FFT algorithm are used To verify the performance on ASIC, all the designs are also
as the baseline. All other areas of FFT architectures by our mapped to 45-nm standard cell library and synthesized by
algorithm, [5], [14], and [28] are normalized by the baseline. Synopsys design compiler. We choose the same cell library
Fig. 8. Normalized FPGA areas of 16- and 32-point FFT designs. Fig. 9. Normalized ASIC areas of 16- and 32-point FFT designs.
TABLE V TABLE VII

O N -C HIP M EMORY R EQUIREMENT OF THE FFT D ESIGNS P OWER D ISSIPATIONS IN mW FOR FPGA AND ASIC I MPLEMENTATION
TABLE VI
C OMPARISON B ETWEEN THE FFT D ESIGNS ON ASIC
Fig. 10. Normalized power dissipations of 16-point FFT designs.

and synthesize all the designs using the same version of
DesignCompiler to make sure that the same FA cells are used
to implement different designs for fair comparison. The syn- 25 MHz and a supply voltage of 1.0 V are used. The power dis-
thesized areas in μm2 and delays in ns of the FFT designs sipations in mW of all designs simulated on both FPGA device
are shown in Table VI. The normalized ASIC areas of FFT and ASIC are listed in Table VII. To present the results visu-
architectures designed by the proposed algorithm, [5], [14], ally, they are plotted in Figs. 10 and 11, respectively, for 16-
and [28] are plotted in Fig. 9. Similarly, the areas of the and 32-point FFTs, where the powers of designs by radix-2
designs by radix-2 DIT-FFT algorithm are used as the baseline. DIT-FFT algorithm are used as the baseline. Our AFFT1 and
From the comparison, AFFT1 and AFFT5 save ASIC area by AFFT5 save ASIC power by up to 53.1% and 60.0% compared
up to 65.7% and 58.8% compared with the 16- and 32-point with the 16- and 32-point designs in [5] and [14], respectively.
designs in [5] and [14], respectively. The result is consistent The power dissipation of AFFT4 on ASIC outperforms the
with the hardware cost reduction on FPGA device. design in [28] by 5.3%.
In addition to hardware area and delay, total power dis- For an N-point DIT-FFT architecture (N ≥ 32), the non-
sipation which includes dynamic power and static power is trivial TFCs exist in (log2 N − 2) stages. The TFMerging
another critical metric to evaluate hardware performance [38]. technique can be applied to more stages as N increases. We
The power dissipations of all approximated FFT designs are have more chances to reduce the number of multiplications
first implemented on FPGA device and synthesized by Xilinx in large-size FFTs because of the much more TFCs exist-
Vivado Design Suite v17.4. The same output rate of 25 MHz ing compared to small-size FFTs. The magnitude-response
and supply voltage of 1.0 V are set for all experiments to aware approximation algorithm can be applied in the same way
have a fair comparison. Moreover, designs are mapped to because the errors evaluation and approximations to reduce
standard cell library for ASIC and simulated by Synopsys FA cost are not limited to small-size FFTs. Additionally, large-
PowerComplier version: J-2014.09-SP3. An output rate of size FFTs can also be decomposed into small-size FFTs and
[5] S. Liu and D. Liu, “A high-flexible low-latency memory-based FFT

processor for 4G, WLAN, and future 5G,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 27, no. 3, pp. 511–523, Mar. 2019.
[6] J. Bhattacharya et al., “Implementation of OFDM modulator and demod-
ulator subsystems using 16 point FFT/IFFT pipeline architecture in
FPGA,” in Proc. IEEE Annu. Inf. Technol. Electron. Mobile Commun.
Conf., Vancouver, BC, Canada, Oct. 2017, pp. 295–300.
[7] 4G LTE Networks. (2018). LTE Advanced Pro to 5G Roadmap. [Online].
Available: https://www.4g-lte.net/lte/lte-advanced-pro-to-5g-roadmap/
[8] W. Hong et al., “Multibeam antenna technologies for 5G wireless
communications,” IEEE Trans. Antennas Propag., vol. 65, no. 12,
pp. 6231–6249, Dec. 2017.
[9] S. L. M. Hassan, N. Sulaiman, and I. S. A. Halim, “Low power pipelined
FFT processor architecture on FPGA,” in Proc. IEEE Control Syst. Grad.
Res. Colloquium, Shah Alam, Malaysia, Aug. 2018, pp. 31–34.
[10] Z. Qian, and M. Margala, “Low-power split-radix FFT processors using
radix-2 butterfly units,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 24, no. 9, pp. 3008–3012, Sep. 2016.
[11] B. V. Uma, H. R. Kamash, S. Mohith, V. Sreekar, and S. Bhagirath,
“Area and time optimized realization of 16 point FFT and IFFT blocks
by using IEEE 754 single precision complex floating point adder
and multiplier,” in Proc. Int. Conf. Soft Comput. Techn. Implement.,
Faridabad, India, Oct. 2015, pp. 99–104.
[12] X. Chen, Y. Lei, Z. Lu, and S. Chen, “A variable-size FFT hardware
Fig. 11. Normalized power dissipations of 32-point FFT designs. accelerator based on matrix transposition,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 26, no. 10, pp. 1953–1966, Oct. 2018.
[13] N. Govil and S. R. Chowdhury, “High performance and low cost
implementation of fast Fourier transform algorithm based on hardware
therefore be implemented by recursively adopting the small- software co-design,” in Proc. IEEE REGION 10 Symp., Kuala Lumpur,
size FFT cores as presented in this article. In conclusion, Malaysia, Apr. 2014, pp. 403–407.
large-size FFTs can benefit from the proposed methods, sim- [14] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, “Hardware efficient mixed
radix-25/16/9 FFT for LTE systems,” IEEE Trans. Very Large Scale
ilarly as the small-size FFTs, and at least the same or even Integr. (VLSI) Syst., vol. 23, no. 2, pp. 221–229, Feb. 2015.
better savings on hardware cost and power dissipation can be [15] V. Ariyarathna et al., “Analog approximate-FFT 8/16-beam algo-
achieved. rithms, architectures and CMOS circuits for 5G beamforming MIMO
transceivers,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3,
pp. 466–479, May 2018.
[16] A. Changela, M. Zaveri, and A. Lakhlani, “ASIC implementation of
V. C ONCLUSION high performance radix-8 CORDIC algorithm,” in Proc. Int. Conf. Adv.
A new hardware area-power efficient approximated FFT Comput. Commun. Informat., Dec. 2018, pp. 699–705.
[17] X.-Y. Shih, H.-R. Chou, and Y.-Q. Liu, “VLSI design and implemen-
design is presented in this article. The proposed TFMerging tation of reconfigurable 46-mode combined-radix-based FFT hardware
technique and magnitude-response aware approximation algo- architecture for 3GPP-LTE applications,” IEEE Trans. Circuits Syst. I,
rithm provide an efficient solution to approximate TFCs for Reg. Papers, vol. 65, no. 1, pp. 118–129, Jul. 2017.
[18] H. Xiao, X. Yin, X. Chen, J. li, and X. Chen, “VLSI design of low-
deriving a new architecture for N-point FFT implementa- cost and high-precision fixed-point reconfigurable FFT processors,” IET
tion. Both 16- and 32-point FFT architectures are designed Comput. Digit. Techn., vol. 12, no. 3, pp. 105–110, May 2018.
by applying the proposed algorithm. In ASIC implementa- [19] J. Chen and J. Ding, “New algorithm for design of low complexity
tion using 45-nm standard cell library, our 16- and 32-point twiddle factor multipliers in radix-2 FFT,” in Proc. IEEE Int. Symp.
Circuits Syst., Lisbon, Portugal, May 2015, pp. 958–961.
designs save area by up to 65.7% and 58.8%, respectively, over [20] N. L. Ba and T. T.-H. Kim, “An area efficient 1024-point low power
relevant designs published recently. Meanwhile, the power radix-22 FFT processor with feed-forward multiple delay commutators,”
simulations results show that the 16- and 32-point FFT archi- IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 10, pp. 3291–3299,
Oct. 2018.
tectures designed by our algorithm save power dissipation by [21] S.-W. Yang and J.-Y. Lee, “Constant twiddle factor multiplier sharing
up to 53.1% and 60.0%, respectively, compared with recently in multipath delay feedback parallel pipelined FFT processors,” IEEE
published solutions. Electron. Lett., vol. 50, no. 15, pp. 1050–1052, Jul. 2014.
[22] M. Garrido, J. Grajal, M. A. Sanchez, and O. Gustafsson, “Pipelined
radix-2k feedforward FFT architectures,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23–32, Jan. 2011.
R EFERENCES
[23] M. Bansal and S. Nakhate, “High speed pipelined 64-point FFT pro-
[1] S.-N. Tang and F.-C. Jan, “Energy-efficient and calibration-aware cessor based on radix-22 for wireless LAN,” in Proc. Int. Conf. Signal
Fourier-domain OCT imaging processor,” IEEE Trans. Very Large Scale Process. Integr. Netw., Noida, India, Feb. 2017, pp. 607–612.
Integr. (VLSI) Syst., vol. 27, no. 6, pp. 1390–1403, Jun. 2019. [24] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Highly energyefficient
[2] A. Chen and X. Wang, “An image watermarking scheme based on DWT and quality-tunable inexact FFT accelerators,” in Proc. IEEE Custom
and DFT,” in Proc. Int. Conf. Multimedia Image Process., Wuhan, China, Integr. Circuits Conf., San Jose, CA, USA, Sep. 2014, pp. 1–4.
Mar. 2017, pp. 177–180. [25] Q.-J. Xing, Z.-G. Ma, and Y.-K. Xu, “A novel conflict-free parallel
[3] A. Wahbi, A. Roukhe, and L. Hlou, “Enhancing the quality of voice memory access scheme for FFT processors,” IEEE Trans. Circuits Syst.
communications by acoustic noise cancellation (ANC) using a low cost II, Exp. Briefs, vol. 64, no. 11, pp. 1347–1351, Nov. 2017.
adaptive algorithm based fast Fourier transform (FFT) and circular con- [26] H. K. Samudrala, S. Qadeer, S. Azeemuddin, and Z. Khan, “Parallel
volution,” in Proc. Int. Conf. Intell. Syst. Theor. Appl., Rabat, Morocco, and pipelined VLSI implementation of the new radix-2 DIT FFT algo-
May 2014, pp. 218–224. rithm,” in Proc. IEEE Int. Symp. Smart Electron. Syst., Hyderabad, India,
[4] N. Bhagat, D. Valencia, A. Alimohammad, and F. Harris, “High- Dec. 2018, pp. 21–26.
throughput and compact FFT architecture using the Good-Thomas and [27] X. Han, J. Chen, and S. Rahardja, “A new twiddle factor merging method
Winograd algorithms,” IET Commun., vol. 12, no. 8, pp. 1011–1018, for low complexity and high speed FFT architecture,” in Proc. IEEE Int.
May 2018. Circuit Syst. Symp., Kuala Lumpur, Malaysia, Sep. 2019, pp. 1–4.
[28] V. Ariyarathna et al., “Multibeam digital array receiver using a 16-point Boyu Qin is currently pursuing the B.Eng. degree
multiplierless DFT approximation,” IEEE Trans. Antennas Propag., with the College of Electronic and Information
vol. 67, no. 2, pp. 925–933, Feb. 2019. Engineering, Nanjing University of Aeronautics and
[29] Y. Ji-Yang, H. Dan, L. Xin, X. Ke, and W. Lu-Yuan, “Conflict-free Astronautics, Nanjing, China.
architecture for multi-butterfly parallel processing in-place radix-r FFT,” His research interest includes digital circuits
in Proc. IEEE Int. Conf. Signal Process., Chengdu, China, Nov. 2016, design and implementation.
pp. 496–501.
[30] S. Mittal, “A survey of techniques for approximate computing,” ACM
Comput. Surveys, vol. 48, no. 4, pp. 1–34, Mar. 2016.
[31] J. Ding, J. Chen, and C.-H. Chang, “A new paradigm of common subex-
pression elimination by unification of addition and subtraction,” IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 35, no. 10,
pp. 1605–1617, Oct. 2016.
[32] J. W. Cooley and J. W. Tukey, “An algorithm for machine calculation of
complex Fourier series,” Math. Comput., vol. 19, no. 90, pp. 297–301,
Jan. 1965.
[33] J. Chen and C.-H. Chang, “High-level synthesis algorithm for the design
of reconfigurable constant multiplier,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 28, no. 12, pp. 1844–1856, Dec. 2009.
[34] K. Moller, M. Kumm, M. Garrido, and P. Zipf, “Optimal shift reas-
signment in reconfigurable constant multiplication circuits,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 3, pp. 710–714,
Mar. 2018.
[35] B. Koyada, N. Meghana, M. O. Jaleel, and P. R. Jeripotula, “A com-
parative study on adders,” in Proc. Int. Conf. Wireless Commun. Signal
Process. Netw., Chennai, India, Mar. 2017, pp. 2226–2230.
[36] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power
digital signal processing using approximated adders,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
Jan. 2013.
[37] R. Kaur and T. Singh, “Design of 32-point mixed radix Fft processor
using CSD multiplier,” in Proc. Int. Conf. Parallel Distrib. Grid Comput.,
Waknaghat, India, Dec. 2016, pp. 538–543.
[38] J. Chen, C. H. Chang, and H. Qian, “New power index model for switch- Susanto Rahardja (Fellow, IEEE) received the
ing power analysis from adder graph of FIR filter,” in Proc. IEEE Int. B.Eng. degree from the National University of
Symp. Circuits Syst., Taipei, Taiwan, May 2009, pp. 2197–2200. Singapore, Singapore, and the M.Eng. and Ph.D.
degrees in electronic engineering from Nanyang
Technological University, Singapore.
He is currently the Chair Professor with
Xueyu Han received the B.Eng. degree from Northwestern Polytechnical University, Xi’an,
Northwestern Polytechnical University, Xi’an, China, under the Thousand Talent Plan of People’s
China, in 2017, where she is currently pursuing Republic of China. He attended the Stanford
the Ph.D. degree with the Center of Intelligent Executive Programme with the Graduate School
Acoustics and Immersive Communications. of Business, Stanford University, Stanford, CA,
Her research interests include algorithms and USA. He contributed to the development of a series of audio compression
circuit design for digital signal processing. technologies, such as Audio Video Standards AVS-L, AVS-2 and ISO/IEC
14496-3:2005/Amd.2:2006, and ISO/IEC 14496-3:2005/Amd.3:2006 in
which some have been licensed to several companies. He has more than
15 years of experience in leading research team for media related research
that cover areas in signal processing (audio coding, video/image processing),
media analysis (text/speech, image, video), media security (biometrics,
computer vision, and surveillance), and sensor networks. He has published
more than 300 papers and has been granted more than 70 patents worldwide
Jiajia Chen received the B.Eng. (Hons.) and Ph.D. out of which 15 are U.S. patents. His research interests are in multimedia,
degrees from Nanyang Technological University, signal processing, wireless communications, discrete transforms, machine
Singapore, in 2004 and 2010, respectively. learning, and signal processing algorithms and implementation.
From April 2012 to March 2018, he was Dr. Rahardja was a recipient of several honors, including the IEE Hartree
a Faculty Member with the Singapore University Premium Award, the Tan Kah Kee Young Inventors’ Open Category Gold
of Technology and Design, Singapore. Since Award, the Singapore National Technology Award, the A*STAR Most
April 2018, he has been with the Nanjing University Inspiring Mentor Award, the Finalist of the 2010 World Technology and
of Aeronautics and Astronautics, Nanjing, China, Summit Award, the Nokia Foundation Visiting Professor Award, and
where he is currently a Professor. His research the ACM Recognition of Service Award. He was an Associate Editor
interest includes computational transformations of of the IEEE T RANSACTIONS ON AUDIO , S PEECH AND L ANGUAGE
low-complexity digital circuits and digital signal P ROCESSING and the IEEE T RANSACTIONS ON M ULTIMEDIA, and the
processing. Senior Editor of the IEEE J OURNAL OF S ELECTED T OPICS IN S IGNAL
Prof. Chen served as the Web Chair of the Asia–Pacific Computer Systems P ROCESSING. He is currently serving as an Associate Editor for the Journal
Architecture Conference in 2005, the Technical Program Committee Member of Visual Communication and Image Representation (Elsevier) and the
of European Signal Processing Conference in 2014, and the Third IEEE IEEE T RANSACTIONS ON M ULTIMEDIA. He was the Conference Chair
International Conference on Multimedia Big Data in 2017, and has been serv- of fifth ACM SIGGRAPHASIA in 2012 and APSIPA second Summit and
ing as an Associate Editor for the EURASIP Journal on Embedded Systems Conference in 2010 and 2018, as well as other conferences in ACM, SPIE,
(Springer) since 2016. and IEEE.

A Novel Area-Power Efficient Design For Approximated Small-Point FFT Architecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Area-Power Efficient Design For Approximated Small-Point FFT Architecture

Uploaded by

Copyright:

Available Formats

4816 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO.

12, DECEMBER 2020

A Novel Area-Power Efficient Design for

square algorithm is used to train the inexact FFTs, where

operations on their real and imaginary parts. The numbers of

r = (x(0) − x(8)) − j(x(4) − x(12))

the proposed TFMerging method while eight multiplications C= ci 2i (10)

Algorithm 5 Pseudo Code of the TFC_Approximation

solutions are compared at last in terms of total FA count

Fig. 6. CS-shared structure when the maximum error allowed is 1.9e−2 .

further by 179. The reason why we apply TFC_approximation

Fig. 7. Approximated 16-point FFT architecture.

TABLE V TABLE VII

Fig. 10. Normalized power dissipations of 16-point FFT designs.

[5] S. Liu and D. Liu, “A high-flexible low-latency memory-based FFT

You might also like