# FPGA implementation of Discrete Fractional Fourier Transform

M.V.N.V.Prasad†1, K. C. Ray†2 and A. S. Dhar‡
Department of Electronics and Communication Engineering, Indian Institute of Information Technology, Allahabad, Uttar Pradesh, India. Email: 1mvnvprasad@gmail.com, 2 kcr@iiita.ac.in
Abstract– Since decades, fractional Fourier transform has taken a considerable attention for various applications in signal and image processing domain. On the evolution of fractional Fourier transform and its discrete form, the real time computation of discrete fractional Fourier transform is essential in those applications. On this context, we have proposed new hardware architecture for implementing a Discrete Fractional Fourier Transform (DFrFT) which requires hardware complexity of O(4N), where N is transform order. This proposed architecture has been simulated and synthesized using verilogHDL, targeting a FPGA device (XLV5LX110T). The simulation results are very close to the results obtained by using MATLAB. The result shows that, this architecture can be operated on a maximum frequency of 217MHz. Keywords– Discrete Fractional Fourier Transform, Hardware Architecture, CORDIC and FPGA.

Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India. Email: asd@ece.iitkgp.ernet.in proposed hardware architecture. Results and discussion of this proposed implementation has been highlighted in section IV. Finally, section V concludes the paper with future scope of this work. II. FRACTIONAL FOURIER TRANSFORM A. Continuous Fractional Fourier Transform. The generalized Fourier transform rotates the signal f(u) in time-frequency plane [1] on the rotation angle of α= aπ (‘a’ is fractional value) and is given in fallowing 2 equation ∞ (1) fα(v) = f(u) Kα(u,v) du -∞

I.

INTRODUCTION

where Kα(u,v) =

1- j cot α e 2π

2 2 j u +v cot α – j u v cscα 2

if α is not a multiple of π if α is a multiple of 2π if α+π is a multiple of 2π

ractional Fourier transform [1], [2],[3] has been an emerging mathematical tool, having wide area of signal [4], Image processing applications like Biomedical signal detection[6], Image registration[7], Image Encryption[5], Security of registration data of fingerprint image[8], Broadband beam forming of LFM signals[9] and Moving target detection and location in space borne SAR. Unlike Discrete Fourier Transform (DFT), Discrete Fractional Fourier Transform (DFrFT) has many definitions, such as direct form, improved sampling-type, linear combination-type, eigenvectors decompositiontype [10], group theory-type and impulse train-type DFrFT. Among these definitions, eigen vector decomposition type is to be a legitimate definition [11] to satisfy all the properties such as unitary, index additive, reduction to DFT when fractional value is one, approximation of continuous fractional Fourier transform. To the knowledge of authors on the evolution of Fractional Fourier transform and its application, no hardware architecture is available except [12] for real time implementation of DFrFT. In our paper new hardware architecture for implementing DFrFT based on eigen vector decomposition have been proposed and implemented on FPGA device for real time applications. The rest of this paper has been organized as fallows; Section II presents brief review on Fractional Fourier Transform and its discrete form. Section III describes the

F

δ(u–v) δ(u+v)

Here ‘v’ is the variable in ath order fractional domain and ‘u’ is variable in fractional domain in order of zero. The kernel Kα(u,v) is decomposed as given in equation (2) in terms of Hermite-Gaussian function [2] which are eigen functions of the Fourier transform. The decomposed kernel is ∞ (2) ψk(v)e-jαk ψk(u) Kα (u,v) = 1/4 2 k=0 and ψk(u)= 2 Hk (√2πu) e -πu

Σ

ψk(u) Hk

is the k order Hermite-Gaussian function, is the kth order Hermite polynomial.

th

√2kk!

B. Discrete Fractional Fourier Transform The discrete fractional Fourier transform has been proposed in [10] using discrete Hermite-Gaussian functions, for N-point as given in equation (3). N-1 α (3) uk[m]e-jαk uk[n] F [m,n] =

Σ

k=0

Where uk[n] is kth discrete Hermite-Gaussian function. The discrete values of continuous Hermite-Gaussian function ψk(v)are approximated by using eigen vectors of commuting matrix S in [10]. The N point DFrFT Matrix for rotation angle α is defined [3] as

Fα =

Σ u [n] e k=0
k

N

-jαk

uTk [n] k ≠ N, for N odd k ≠ N-1, for N even

(4)

= U E UT Where U is discrete Hermite-Gaussian matrix consists of discrete Hermite-Gaussian functions as in the fallowing equation
u0[1] u1[1] . . uN-2[1] uM[1] u0[2] u1[2] . . uN-2[2] uM[2] . . . . . . . . . . . . u0[N] u1[N] . . uN-2[N] uM[N]

blocks U1 and U2 are denoted as multi-blocks. In Fig. 4 the data flow between these blocks is given in detail. The time period between two successive input samples f and the time period between two successive output results fα are same. The rest of this section presents the detail description of each level of proposed architecture. Level-I: In an N-point DFrFT, this level-I is partitioned into two parts. The first part performs the calculation of eigen values for given rotation angle (α) using a block named as C in the architecture as shown in Fig.2. This block receives an angle for every N clock cycles and it computes corresponding N complex conjugated eigen values. The results of block C for given angle α are ej0α, ej1α, ej2α,….ej(N-2)α, ejMα, where M=N-1, for N odd and
M=N, for N even.
Clk
Clk Clk Pipelined CORDIC (Calculates Sin & Cos Values) Real Part Imaginary Part Clk Clk Clk R32 R42 RN2 R31 R41 Clk

U=

here M=N-1; for N odd M=N; for N even

(5)

and ‘E’ is a diagonal matrix which contains the eigen values e-j0α, e-j1α, e-j2α,..... e-j(N-2)α, e-jMα of DFrFT matrix Fα as diagonal elements. The response of an N-point DFrFT ‘fα[n]’, for N input samples f[n] with rotation angle α can be calculated by fα[n]= Fα f[n]. i.e. fαN×1=UN×N*(EN×N*(UTN×N*fN×1)). Here * indicates matrix multiplication operation. For the proposed architecture the matrix E is replaced with a column matrix C that contains the Eigen values of DFrFT for given input angle α and middle matrix multiplication is replaced by an array multiplication. The modified expression is fαN×1=UN×N*(CN×1×(UTN×N*fN×1)), Where‘×’ indicates the array multiplication operation. III. PROPOSAL OF DFRFT ARCHITECTURE The proposed architecture is composed of three levels. The input data to be process is flow through all the three serially connected levels as shown in Fig.1.
Rotation angle (α) Level-I Input ‘f ’

Enable
C.E
Counter Counts (0 – N-1); If N Odd Counts (0 – N-2, N); If N Even

Clkn
Clkn

*
R2
RN1 Clk

R1

Rotation angle (α) Output (Real Part)

Clk

C

Clk

Fig. 2: Calculation of Eigen values.

Output (Imaginary Part)

C E U2

U1

Level-II

Level-III

Rotated input ‘fα’ Fig.1: Block diagram of the DFrFT

The level-I performs two mathematical operations, one is calculation of eigen values for given input rotation angle and another is calculation of the response of matrix UT for input samples f. these two operations are carried out by two blocks of level-I named as C and U1. This level passes two computed results that are matrix C and UT*f to the level-II, which execute the multiplication of eigen values with the response of U1 block and feeds the product C×UT*f to the level-III. In this level we get the rotated input samples fα = U*C×UT*f as an output, by the act of matrix multiplication between level-III input and Hermite-Gaussian matrix U. If input samples are complex values (f=a+jb), we have to calculate the response of U1 block separately for both real and imaginary parts, so that we need two U1 blocks. Similarly for any type of input samples f, two U2 blocks are required to process Level-III real and imaginary inputs separately. For this reason in Fig.1 the

The architecture for calculation of eigen values requires two clocks, i.e. clock1 (Clk) having the frequency same as sampling frequency and another clock2 (Clkn) having 1/Nth of frequency of clock1. With active high enable signal, the counter counts in sequence …0, 1, 2,…N-2, M, 0, 1, 2... . This counter output is connected to a multiplier which took rotation angle as another input through a register ‘R1’ that receives clock2. The results of multiplier 0, α, 2α… (N-2)α, Mα; M=N-1 for N odd, M=N for N even are fed to the pipelined CORDIC (CO-ordinate Rotation DIgital Computer) by another register ‘R2’. The CORDIC [15] calculates the cosine and sine values of its input angles, which are real and imaginary parts of complex conjugated eigen values for given rotation angle. The real and imaginary parts of computed results pass to the output real part port and output imaginary part port respectively through a set of registers as shown in fig.2. The requirement of these registers has been presented at the end of Level-I explanation. The block ‘U1’ of second part of level-I multiplies input values f with the matrix UT. This part consist of a mod-N counter, ‘N’ number of ROMs with N address locations per each ROM, N Multipliers, N accumulators, one N to1 Multiplexer and set of buffers. The data flow in this part is shown in Fig.3. As in block ‘C’ this ‘U1’block also operates with two clocks named clock1 (clk) and clock2 (clkn). The N rows of the matrix UT are stored in N ROMs. The arrangement of rows of matrix UT in ROM is shown in Table-I.

TABLE-I ARRANGEMENT OF THE ELEMENTS OF MATRIX “UT” IN ROMS

Address ROM ROM . . ROM ROM Location 1 2 N-1 N 0 UTR+1,1 UT R+2,1 . . UTR-1,1 UT R,1 1 UTR+1,2 UT R+2,2 . . UT R-1,2 UT R,2 . : : : . : : T T N-1 U R+1,N U R+2,N . . UT R-1,N UT R,N UTk,l – Indicates the element belongs to kth row and lth column of UT Matrix, R is the value of N/2 that’s Rounded towards Zero

The ROMs are accessed with a ring counter with active high enable signal as shown in Fig.3.
Counter Out
Clk
0 1 2 . 0 1 2 . 0 1 2 .

maintain same latency for both the blocks, it is necessary to insert a set of registers either in block C or block U1. The number of registers is depends upon the values of N and Ci, where Ci is number of pipelines used in CORDIC. If N>Ci–1, then the N+1–Ci number of registers have to add in block C, addition of register set in block U1 is not required and the latency is L=N+3. If N<Ci–1, then the Ci–(N+1) number of registers have to add at the output of multiplexer in block U1, addition of register set in block C is not required and the latency is L= Ci+2. If N=Ci+1, then the latency of both the blocks is same, register set is not required in both C, U1 blocks and latency is L=N+3=Ci+2. The data flow from this block to next blocks is shown in Fig.4.
Enable Enable Enable Rotation Angle α Clk

Clk

Counter

C.E

ROM 1

ROM 2

ROM N

Clk

Clkn

Clk

Clkn

Enable

Real(f) ‘U1’ Imag.1 Imag.2
. N-1 . N-1 . N-1

‘C’

‘U1’ Imag.(f)

Real 2

Real1

R1

*
Clk Clk Clr R21 Clk

*
R22 Clk Clk Clr

*
R2N AccumulatorN

Counter out
Count r1

Counter out
iN Count

f

E
r2 rN i1 i2

Clkn

Clk

Clk Accumulator1 Clr Accumulator2 Clkn R32

‘U2’ for Real Part Real (fα[n])

‘U2’ for Imag. Part Imag. (fα[n])

Clkn

Clkn R31

Clkn R3N

Level-II:

Fig. 4: Data flow Diagram of DFRFT

f(n)*UT

N to 1 MUX
R(Ci+1)

U1

Fig. 3: The Data flow for part II of level-I

All the data of corresponding address locations of N ROMs is proceed to N multipliers with sampled input f[n] as another input. At every clock1 cycle, all the N multipliers multiplies sampled input with output values of the corresponding N ROMs, and forwards these results to their N accumulators through registers as shown in Fig.3. Each of these N accumulators performs addition operation between its input and output values on every clock1 cycle. When all these N accumulators adds their N set of inputs, these accumulators sends the resultant data to next stage and clears the accumulators to add N set of fresh inputs. The N accumulator outputs passes through the Nto1 multiplexer to set of registers that are operate with clock1 (clk). The multiplexer selection line is connected to counter output. The multiplexer inputs are connected to the N accumulator outputs in such a way that the 1st, 2nd, 3rd….Nth valid output values of the Nto1 multiplexer should be the 1st, 2nd, 3rd….Nth accumulator output values. In level-II we have to execute a mathematical operation in between the outputs of block C and block U1. So that it is necessary to forward the computed results of block C and block U1 at the same time to level-II, but the latency of block C varies with the number of pipelines used in CORDIC and the latency of block U1 depends upon the value of N. In order to

The Level-II has a complex multiplier followed by two serial in parallel out shift registers and a set of 2N Registers. This level receives the real, imaginary parts of complex conjugated eigen values form the block C through its Real2(R2), Imag.2(I2) ports respectively and the response of block U1 for input samples ‘f’ is received by its another two input ports Real1(R1), Imag.1(I1). The Block diagram is shown in fig.5.
Real1(R1) Real2(R2) Imag.2(I2) Imag.1(I1)

R5

Clk

Clk

Clk

R4

*
Clk

*
Clk

*
Clk

*
Clk

R1

R2

R3

R4

Clk

+
Out1
Serial in Parallel out Shift Register

E
Cl k Cl k

Out2
Serial in Parallel out Shift Register

Clkn Clkn

‘2N’ Number of Registers Real Part Output Imaginary Part Output

Fig 5: Complex Multiplier with shifting operation

The complex multiplier of this block is different from the ordinary complex numbers multiplier. This complex multiplier performs the multiplication between the Eigen values and the results of block U1 by taking the complex conjugate of Eigen values and the results of block U1 as inputs. For every clock cycle the complex

multiplier multiplies a new pair of complex numbers. The outputs of complex multiplier out1, out2 release the results of mathematical computations (R1×R2) + (I1×I2), (R2×I1) – (R1×I2) respectively. The two resultant outputs, one is real part and another imaginary part are connected to two serial in parallel out shift registers. The number of registers required for each shift register is N-1. For every N-1 clock cycles the complex multiplier passes the N-1 results to this serial in parallel out shift register. The shift register fallowed by a set of 2N registers. The first and (N+1)th register are connected to real and imaginary outputs of complex multiplier respectively. Remaining 2 to N and N+2 to 2N registers are connected to the N1outputs of first shift register (corresponding to out1) and N-1 outputs of second shift register (corresponding to out2) respectively as shown in Fig.5. But these registers operate with the clock2, unlike the shift registers, which operate by the clock1. Level-III: This level-III performs another matrix multiplication operation on the outputs of level II. The signal flow graph is shown in Fig.6.
Input 1

memory locations of N ROMs multiplies with output values of level-II as shown in the Fig.6. This N resultant multiplier outputs are added by using N-1 adders and send out as rotated input samples in time-frequency plane with given angle α. IV. RESULT AND DISCUSSION

Input 2
0 1 2 .

Input N
0 1 2 .

ROM 2

0 1 2 .

Count Input

The proposed architecture discussed in the previous section had been designed using verilogHDL for the order of N equal to four. The design has been simulated using Xilinx simulator with random input samples f(n) = [11+3i, 9+2i, 7+4i, 8+2i] as test vector. For the sake of simplicity and to realize the outputs of the design, the integer values for the inputs have been chosen which are representing with five bits (one bit for sign and four bits for integer value). The internal precession of each block has been chosen according to avoid maximum truncation error. Finally the outputs are given in 16-bit format (where one bit for sign, four bits for integer and eleven bits for fractional value). Similarly for the fractional value α, the format has been chosen with binary weightage as [-π π1 π2 π3 . . . . πb-1 ]. In this case b=16. 2 2 2 2 The hardware complexity of the proposed design for the th N order of DFrFT has been summarized in Table-II. This design is based on pipelined approach; hence the design requires latency period L+N+1, where L is Latency of the CORDIC.
TABLE-II HARDWARE REQUIREMENT FOR N-POINT DFRFT

. N-1

. N-1

Data 1

. N-1

Data 2

Data N

*
U2

*

*

Fig 6: Signal Flow graph for level-III

Component Name N×16NbitROM Multipliers Adders/Subtractors N to 1 Multiplexers Counters Registers

Number of Components 2 4N+5 4N+ Adders in CORDIC 2 2 10N+6+Ci+2×(|N+1-Ci|)

This level has N ROMs, each ROM stores a column of matrix U of size N×N. Because of accessing all ROMs using the counter output of block U1, to maintain the synchronous between ROMs output values and input values of multiplier the arrangement of matrix elements in ROMs is as fallows, the data of address 0 of all ROMs contain the rth row of matrix U. where r is the remainder of (N+L)/N. The address1 of N ROMs stores the next row of the matrix, and remaining locations of N ROMs fallows the same sequence. By fallowing this sequence the (r-1)th memory location stores the first row of the matrix. When counter counts k, all the data in kth

The simulation output of ISE 10.1i has been presented in Table-III, which shows that the verilogHDL simulation results of proposed design are close to MATLAB simulation outputs. The simulated output with timing has been shown in Fig.7. This shows that the proposed architecture takes latencies of 19 clock cycles (14 clock cycles for Level-I and 5 clock cycles for both Level-II and Level-III discussed in previous section). Finally the proposed design has been synthesized using Xilinx XST tool, targeting a FPGA device (XLV5LX110T) [15]. The synthesis results obtained for hardware has been presented in Table-IV.

Fig.7: The Simulation Results of proposed DFRFT architecture using ‘Xilinx ISE’ Simulator

TABLE-III COMPARISON OF MATLAB AND XILINX-ISE SIMULATION RESULTS

MATLAB Simulation Results Decimal values 10.5406+4.2159i 9.0500+2.2435i 7.0371+3.6100i 8.0513+2.1945i

Xilinx-ISE Simulation Results of Proposed Architecture Decimal (Hexadecimal ) values 10.7929+3.6176i (5658+1CF1i) 9.0234+2.1074i (4830+10DCi) 7.0151+3.8052i (381F+1E71i) 8.0234+2.1074i (4030+10DCi)

results and also compared with existing architecture presented in [12]. The implementation results shows that the proposed design is suitable to most of signal, image processing and communication systems. The proposed architecture and its implementation is fixed in terms of transform length order N. i.e. N is fixed which constraints to specific applications. Flexibility of architecture is required to meet the demand of all applications. In this context, authors of this paper have been working for designing a unified architecture suitable for all applications. REFERENCES
[1] L. B. Almedia, “The Fractional Fourier Transform and Time-Frequency Representations”, IEEE Trans. On Sig. Process., vol.42, pp. 3084-3090, November 1994. [2] V. Namias, “The Fractional Order Fourier Transform and its Application to Quantum Mechanics”, inst. Math. Appl., vol.25, pp. 241-265, August 1980. [3] S. C. Pei, C. C. Tseng, M. H. Yeh, and J. J. Shyu, “Discrete fractional Hartley and Fourier transforms,” IEEE Trans. Circuits Syst. II, vol. 45, pp. 665–675, 1998. [4] H. M. Ozaktas, B. Barshan, D. Mendlovic, L. Onural, “Convolution, filtering, and multiplexing in fractional fourier domains and their relation to chirp and wavelet transform”, J. Opt. Soc. Am. A, vol. 11, pp. 547-559, February 1994. [5] N. Zhou, T. Dong, “Optical image encryption scheme based on multiple parameter random fractional Fourier transform”, 2009 Second Int. Symposium On electronic commerce and security, pp. 48-51, 2009. [6] Y. Zhang, Q. Zhang, Shaohua Wu, “Biomedical signal detection based on Fractional Fourier Transform”, IEEE, ITAB 2008, pp.349 – 352, May 2008. [7] W. Pan, K. Qin, Y. Chen, “An Adaptable-Multilayer Fractional Fourier Transform Approach for Image Registration” IEEE Trans. on pattern analysis and machine intelligence, vol 31, March 2009. [8] R. IWAI, H.Yoshimura, ”Security of registration data of fingerprint image with a server by use of the fractional Fourier transform”, IEEE, ICSP2008 Proceedings, pp.2070-2073, 2008. [9] WU. Hai-zhai, Tao ran, ” Broadband Beamforming of LFM signal based on Fractional Fourier Transform”, ICSP2008 Proceedings, pp.296-298., 2008. [10] C.Candan, M.A.Kutay, H.M.Ozaktas, “ The Discrete Fractional Fourier Transform”, IEEE Trans. on sig. process., vol. 48, pp. 1329-1337, May 2000. [11] T. Ran, Z. Feng & W. Yue, “ Research progress on discretization of fractional Fourier transform”, Springer, Sci. China Ser F-Inf Sci., pp. 859-880, July 2008 [12] P. Sinha, S. Sarkar, A. Sinha, D. Basu, “ Architecture of a configurable Centered Discrete Fractional Fourier Transform Processor” IEEE Circuits and Systems, MWSCAS 2007. 50th Midwest Symposium, pp.329-332, 2007. [13] S. C. Pei, W.L. Hsue, J.J.Ding, “Discrete Fractional Fourier Transform Based on New Nearly tridiagonal commuting Matrices”, IEEE Trans. on Signal processing, vol.54, pp. 3815-3828. October 2006. [14] K.C. Ray and A.S. Dhar, “CORDIC-based uniﬁed VLSI architecture for implementing window functions for real time spectral analysis”, IEE Proc.-Circuits Devices Syst., Vol. 153, pp. 539-544 , December 2006. [15] Xilinx, “Virtex-5 FPGA User Guide”, UG190 (v4.7) May 1, 2009.

TABLE-IV HDL SYNTHESIS REPORT- MACRO STATISTICS

Component Name 4×64bitROM Multipliers Adders/Subtractors 4 to 1 Multiplexers Counters Registers Accumulators

Number of Components 2 21 56 2 2 71 9

The synthesis report in this table shows that the synthesis results for hardware requirement are approximately same as the theoretical results. Timing report of this implementation shows that the proposed design can be operated at maximum frequency of 217MHz. the proposed architecture in this paper has been compared with the architecture presented in[12] for N=1024. The comparison for hardware and timing has been highlighted in Table-V.
TABLE-V COMPARISON OF PROPOSED ARCHITECTURE WITH [12] FOR 1024-POINT DFRFT

Hardware requirement Number of Components Component Name Architecture Proposed in [12] Architecture Multipliers 1048576 4101 Adders/ Subtactors 1048576 4144 Registers 5242880 12280 3072 (2:1 Mux) Multiplexers 2 (1024:1 Mux) 3072 (4:1 Mux) Counters Not Mentioned 2 Timing details Maximum speed 99.58 MHz 217.39 MHz Sampling frequency 33.00 MHz 217.39MHz

This shows that the proposed design in this paper is better in terms of hardware complexity and timing compared to architecture presented in [12]. V. CONCLUSION

In this paper, new hardware architecture for computing DFrFT has been proposed. This architecture has been described using verilogHDL, synthesized and implemented on targeted FPGA device (XLV5LX110T). The simulation results are verified with MATLAB and the implementation results are compared with theoretical