FFT Implementation Using QCA: Muhammad Awais, Marco Vacca, Mariagrazia Graziano and Guido Masera

FFT Implementation using QCA
Muhammad Awais, Marco Vacca, Mariagrazia Graziano and Guido Masera

Politecnico di Torino, Department of Electronics and Telecommunication, Italy
AbstractQuantum dot Cellular Automata (QCA) is an emerging nanotechnology paradigm that is currently being investigated as a possible CMOS substitute. It offers higher speed and lower area and power consumption than CMOS transistors. However, due to its intrinsic pipelined nature, QCA circuits suffer from serious throughput reductions due to feedback signals. As a consequence to fully exploit the true potential of this technology, circuits architecture must be designed with the aim to reduce or eliminate the presence of feedbacks. This work proposes as a relevant design case, the QCA implementation of Fast Fourier Transform (FFT) Algorithm. A novel architecture for partial parallel FFT processor is presented which not only reduces the circuit complexity but also eliminates the need of feedback signals, allowing to maximize the throughput. The proposed architecture is described using an accurate, layout aware VHDL model which is exploited in a hierarchical bottom up approach to evaluate the logical behavior, area and power dissipation of whole design. This innovative approach widely expands the eld of application for QCA circuits. Index TermsQCA, FFT, Digital signal processing, VHDL.
I. I NTRODUCTION Quantum dot Cellular Automata (QCA) [1] is an emerging nanotechnology that has gained signicant research interest in recent years. Extremely small feature size, ultra low power consumption and high clock frequency make QCA a potentially attractive solution to implement processing architectures at the nano-scale. Two are the appealing implementations of QCA: molecular QCA [2] built using complex molecules with many oxide-reduction centers and magnetic QCA [3] based on single domain nanomagnets, with only two stable magnetization states. Unlike conventional logic circuits which rely on the conduction principle, QCA operates by the Coulomb (molecular QCA) or magnetostatic (magnetic QCA) interaction that connects state of one cell to the state of its neighbors. Many QCA circuits thought of the general principle have been proposed in literature. This includes simple combinatorial blocks like multiplexers, more complex arithmetic circuits like adders and multipliers or dividers, sequential circuits like latch and memories. Most of these circuits are made using QCADesigner tool [4], which allows physical placement of individual cells. A good summary of the state of the art for QCA design is in [5]. However, only facing the design of complex applications we can pin point the real positive and negative aspects of QCA technology as a possible CMOS substitute. Fast Fourier Transform (FFT) algorithms are one of such applications which are adopted as one of the most important tools in digital signal processing. Implementation of FFT algorithm using CMOS VLSI technology is a challenging task mainly because of computationally intensive nature of
algorithm and real time signal processing requirements. In low and medium throughput applications, CMOS based implementations usually adopt partial parallel architectures with internal feedbacks to reduce FFT circuit complexity. Unfortunately, due to the inherent pipelined behavior of QCA technology (see section II), the presence of feedback signals in the circuit leads to relevant performance drops [6]. This problem seriously limits the eld of application of QCA technology. In this work as an absolute novelty in literature, we propose a QCA architecture of the FFT processing core, which could also be extended to cover similar applications, such as wavelet lters [7]. A N-point FFT in partial parallel, time multiplexed way is implemented without using feedback signals, in order to achieve the highest throughput and full exploit QCA potential. The circuit is described using a realistic layout aware VHDL model which allows to verify the logical behavior of the circuit and performance estimation for both magnetic and molecular QCA. Simulation results show that remarkable area saving and high throughput could be achieved for molecular QCA implementation, while magnetic QCA is attractive for achieving low power. II. QCA
BACKGROUND
QCA is based upon the encoding of binary information in the charge conguration within quantum dot cells. A typical ideal QCA cell can be viewed as a square, in which a charge container or quantum dot is placed at each vertex. At ground state (equilibrium), the cell contains a bunch of extra electrons that are conned in the cell but can quantum mechanically tunnel between the dots. The electrons are forced to occupy the two dots on the diagonal due to Coulomb repulsion. As there are two possible diagonals (polarizations), two ground states are possible and represent logic 0 and 1(Fig. 1.A). The QCA wire, inverter and majority gate are the QCA primitives. The QCA wire is made by cascading base cells (Fig. 1.B). The majority gate is the fundamental logic gate in QCA design; with inputs A, B and C it realizes the logic function Y=M(A,B,C)=AB+AC+BC (Fig. 1.C). Logical AND and OR functions can be realized by setting an input of the majority gate, permanently to a 0 and 1 value respectively. The inverter is built exploiting the diagonal coupling among neighbor cells (Fig. 1.D). Synchronization in QCA circuits is achieved with the help of adiabatic switching that modulates the interdot tunneling barrier of QCA cell. This is accomplished by using four 3 distinct and periodic phases (0, 2 , , 2 ) of a reference clock signal as shown in Fig. 1.F. A QCA circuit is partitioned into a number of clock zones where adjacent clock zones have a
978-1-4673-1260-8/12/$31.00 2012 IEEE
741
III. FFT A LGORITHM The Discrete Fourier Transform (DFT) of N complex data points x(k ) is dened as small X (n) =
N 1 n=0
x(k )e(j
N 1 n=0
2nk N )
k = 0, 1, 2, , N 1 (1) (2)
= Figure 1: Quantum dot Cellular Automata. A) QCA cells representing logic value 0 ,1 B) QCA wire. C) Majority voter. D) Inverter. E) Four phase clocking scheme
nk x(k ).WN
2nk
/2 phase shift between them and every fourth clock zone will have the same applied signal. The four phases are Relax, Switch, Hold and Release. During the Relax phase, there is no inter dot barrier and cell remains unpolarized. During Switch phase, the inter dot barrier is slowly raised and cell attains a polarization under the inuence of its neighbors. In the Hold phase, barriers are high and cell retains its polarity acting as an input to the neighboring cells. Finally, in the Release phase, barriers are lowered and cell looses its polarity. This clocking mechanism is responsible for inherent pipelined behavior of QCA and multi bit information transfer through signal latching. A signal is effectively latched when one clock zone goes into Hold phase and acts as an input to the subsequent zone. This intrinsic pipelining behavior leads to a loss of throughput when there are feedback in the circuit [6]. The throughput can be N times lower, where N is the feedback length in terms of clock cycles. This is a well known problem in CMOS, but in QCA technology it is amplied, because the length of the pipe is much higher. It is therefore mandatory to design circuits trying to eliminate feedbacks. Apart from the original proposal two are the appealing implementations of QCA i.e. magnetic QCA (MQCA) and molecular QCA. In the magnetic QCA [3] the base cell is an asymmetric single domain nanomagnet that can be magnetized only in two possible ways encoded as two possible logic states i.e. 0 and 1. The cell switching from one state to the other one requires a clock which in MQCA case is a strong magnetic eld which is normally generated by a current owing through a wire placed under the magnets plane. While the clock frequency achievable is low (100MHz) this QCA implementation is interesting for the low power consumption and the possibility to merge logic and memory in the same device. Molecular QCA is built using molecules with few oxide-reduction (redox) sites for charge localization and bridging ligands to provide tunneling among those sites. Redox sites act as quantum dots, able to encode and propagate information. Very low dimensions (1-2nm [2]) and very high switching speeds (1THz [2]) could be reached. For molecular QCA, the clock is generated by applying an electrical eld perpendicular to the molecular plane [8].
nk where WN = e(j N ) is called the twiddle factor. The direct implementation of DFT has a complexity of O(N 2 ). Using the FFT, the complexity can be reduced to O(N log2 N ). Based on Cooley-Tukey algorithm [9], two kinds of strategy exist to compute FFT i.e. Decimation In Frequency (DIF) and the Decimation In Time (DIT). Since the computational complexities of the two strategies are same, we only focus on the DIT algorithm. When N = 2r the N-point sequence can decompose into two (N/2)-point subsequences as follow:
x1 (r) = x(2r), x2 (r) = x(2r + 1), r = 0, 1, , N/2
Now the N point DFT in equation (2) can be expressed as

N 2
X (k) =
N 2
N 2
x1 (2n) +
x2 (2n + 1)
(3)
n=0 1 k WN
n=0
N 2
x1 (2n) +
x2 (2n + 1)
(4)
n=0
n=0
Considering the properties of twiddle factor we have

k X (k) = X1 (k) + WN X2 (k) N k X (k + ) = X1 (k) WN X2 (k) 2
(5) (6)
For N=2, FFT reduces to a structure known an as radix-2 buttery. Because N is a regular power of 2, the same computational procedure can be applied recursively until eventually the N-point DFT is evaluated as a collection of 2-point DFTs. IV. P ROPOSED FFT A RCHITECTURE A full parallel implementation of N-point radix-2 DIT FFT can be represented by an L M matrix of radix-2 butteries where M = log2 N is the number of stages with each stage consisting of L = N 2 radix-2 butteries. Each buttery involves one complex multiplication and two complex additions requiring an overall L M complex multiplications and N log2 N complex additions for N-point FFT. Therefore, this approach is not suitable for most standard applications which require very large values of N. An alternative solution is the partial parallel approach in which a subset of butteries are implemented in hardware and operate on N-data points in a time multiplexed way. Each buttery receives its inputs from the memory or the results of previous calculation coming as a direct feedback. A straightforward mapping of such an architecture on QCA is not efcient because of the loop path delay arising from the inherent pipelined structure. This loop unfolding is the main contribution and innovation of this work whereby we present a low complexity partial parallel
742
P-Parallel Block
PS #1
d d d
PS # log2 (N/P)
1 in1 in2
Log2P
1
d
d d
out1 out2
INPUTS [1]2 [0]2[1]1[0]1
4-PARALLEL BLOCK
0 [5]2 [4]2[5]1 [4]1 W8
W80
P/2 W1 WP/2 sel1 W1,1 W1,p/2 selX WX,1 WX,p/2

In P-1 In P
d1
d d
P/2
d1
d d
P/2 outP-1 outP
[3]2 [2]2[3]1 [2]1

0 [7]2 [6]2[7]1 [6]1 W8
W82
1
1 0
DP
W81 W80 W81 W80 W83 W82 W83 W82
PS #1 (0)2(1)1(0)1 XX 1 (1)2 (0)2 (1)1(0)1 (2)2(3)1(2)1 XX XX 1 (3)2(2)2 (3)1(2)1 XX (4)2(5)1 (4)1XX 1 (5)2(4)2 (5)1 (4)1 (6)2(7)1(6)1 XX XX 1 (7)2(6)2(7)1 (6)1XX 1 0 1 0
W81 W80 W81 W80 W83 W82 W83 W82
OUTPUTS
0 1 0 1 0 1 0 1
{2}2{0}2 {2}1 {0}1 {6}2{4}2 {6}1 {4}1 {3}2{1}2 {3}1 {1}1 {7}2{5}2 {7}1 {5}1
DM
Time
(A)
DP 0 0 2 4 6 1 3 5 7 DM DB DM X[0] X[4] X[2] X[6] X[1] X[5]
(C)
PS #1 (1)1 (2)1 (0)1 XX 1 (3)1 (1)1 (2)1(0)1 (5)2(6)1 (4)1 XX XX 1 (7)2 (5)1 (6)1 (4)1 XX 0 1 0 1 1 1 0 0
DM 0 1 0 1 2 2
x[0] x[4] x[2] x[6] x[1] x[5] x[3] x[7]
W80 W80 W80 W80
4 2 6 1 5 3 7
W80 W82 W80 W82
0 2 4 6 1 3 5 7
0 1 2 3 4 5 6 7
W80 W81 W82 W83
INPUTS BLOCK [3]1 [1]1 [2]1[0]1 [7]1 [5]1 [6]1[4]1

W80 W80 W80 W80
2PARALLEL
PS #2 (4)1(0)1 XX XX 0 (5)1(1)1(4)1(0)1 1 (6)1 (2)1 XX XX XX 2 0 (7)1 (3)1(6)1 (2)1 XX XX 1
OUTPUTS {3}1{2}1 {1}1 {0}1 {7}1{6}1 {5}1 {4}1
1 1
0 1 1 0
0 0
DB
W82 W80 W82 W80 W83 W81 W82 W80
W82 W80 W82 W80 W83 W81 W82 W80
X[3] X[7]
(B)
(D)
Time
Figure 2: Proposed FFT processor A) Generalized architecture B) 8-Point DIT FFT data ow C) Proposed 4-point parallel architecture to implement 8-Point DIT FFT D) Proposed 2-point parallel architecture to implement 8-Point DIT FFT
FFT architecture utilizing only forward data ow, maximizing therefore the throughput. Figure 2.A shows a generalized architecture of proposed QCA based FFT processor. To implement N-point DIT FFT using this architecture, P data points are accessed in parallel N where P = 2 i , (i = 1, 2, 3, ...). The P-parallel block is a P log P matrix of radix-2 butteries and implements P2 2 point DIT FFT in a full parallel way. The outputs of P-parallel block are connected to a cascade of log2 (N/P ) partial stages (PS). Each PS consists of P/2 radix-2 butteries whose inputs are the outputs of previous stage connected either directly or delayed by d clock cycles where delay d = 2n and n(= 1, 2, , log2 (N/P )) is the index of PS. The proposed architecture can be further explained by taking as an example the case of 8-Point DIT FFT whose data ow is shown in Fig. 2.B. For parallelism P=4, the simplied architecture is shown in Fig. 2.C where a 4-parallel block implements the 4-point DIT FFT and the number of partial stages (PS) is log2 ( 8 4 ) = 1. The delay element d has a delay of 1 clock cycle. For parallelism P=2, the proposed architecture is shown in Fig. 2.D. The P-parallel block consists of one radix-2 buttery and the number of partial stages is log2 ( 8 2 ) = 2 with clock cycle delays indicated as numbers on delay elements. Figures 2.C and 2.D also show the values with respect to time, of inputs, outputs and control signals. Following notations are followed. [n]j represents the data point n of input frame j where n = 1, 2, , N and j = 1, 2, . {n}j represents the FFT of data point n of input frame j and (n)j represents the intermediate value of data point n of input frame j . The lines denoted as DB, DP and DM represent the propagation delay (in terms of number of clock zones) of the radix-2 buttery, P-parallel block and 2x1 multiplexer respectively.
V. VHDL
MODEL ,
L AYOUT AND R ESULTS
In this work, we have used VHDL to model QCA circuits. The construction of VHDL model is simple: starting from the real layout of the circuit, for each clock zone a register is used to model the propagation delay, while ideal wires and gates (majority voters and inverters) are used to model the computational part of clock zone. Registers are connected to their specic clock signals according to the selected clock scheme. To estimate the cell count (molecular or magnetic) of the whole design a heirarchical bottom up approach is adopted where the cell count of a particular block is calculated by adding the number of cells of its sub blocks, multiplied with some numeric constants to keep into account the overhead due to interconnections. This cell count is then used to estimate the total area of the circuit and power dissipation due to cell switching. Further details of this model can be found in [6], [10], [11]. We designed the circuit using QCADesigner [4] tool. This allows to better understand the layout structure in QCA technology. Figure 3 shows the layout of Radix-2 DIT buttery. For the sake of simplicity, two bits are used to represent each input and output message. The design utilizes 2-bit pipelined array multipliers M(1-4) as proposed in [12] and 2bit ripple carry adders SUB(1-3) and ADD(1-3) as proposed in [13]. The design has been optimized so that number of wire crossings are minimal. The largest section of QCA cells switching at the same time is limited to 20. The buttery layout was simulated exhaustively with QCADesigner tool using the coherence vector based approximation. Full layout of FFT and simulation is not reported due to space limitations. Table I shows area and power calculations for proposed FFT processor for magnetic and molecular QCA technologies. Each input/output message is represented as twos complement 8 bit
743
ADD 2 M1 SUB1 O1_R(0) O1_R(1)
B_R(0) B_R(1) B_i(1) B_i(0) SUB 2
O2_R(0) O2_R(1)
W_R(0) W_R(1) A_R(1-0) A_I(1-0) W_I(0) W_I(1)
M2
ADD 3 O1_I(0) O1_I(1) ADD 1 O2_I(O) O2_I(1)
M3
SUB 3 M4
Figure 3: Two bit Radix-2 DIT Buttery Layout using QCADesigner Tool.
xed point Q[1.7] notation with 1 integer and 7 fractional bits. The table also shows latency using 4-phase adiabetic clocking of proposed architecture for different values of N and P.
Tech. N P F A(m2 ) P(W) F A(mm2 ) P(mW) Lat. 8 4 0.087 0.464 0.0725 0.019 83 2 0.042 0.224 0.035 0.009 78 16 8 4 1 THz 0.462 0.231 2.464 1.232 100 MHz 0.385 0.19 0.1 0.05 110 106 32 16 2.31 12.3 1.93 0.5 138 8 1.155 6.16 0.96 0.25 130
be revealed. Provided results show the correctness of this approach, with a great area and power consumption reduction keeping at the same time the maximum throughput. R EFERENCES
[1] W. P. C.A. Lent, P.D. Tougaw and G. Bernstein, Quantum cellular automata, Nanotechnology, vol. 4, no. 1, pp. 4957, 1993. [2] Y. Lu, M. Liu, and C. Lent, Molecular electronics - from structure to circuit dynamics, in Nanotechnology, 2006. IEEE-NANO 2006. Sixth IEEE Conference on, vol. 1, june 2006, pp. 62 65. [3] M. Niemier, G. Bernstein, G. Csaba, A. Dingler, X. Hu, S. Kurtz, S. Liu, J. Nahas, W. Porod, M. Siddiq, and E. Varga, Nanomagnet logic: progress toward system-level integration, J. Phys.: Condens. Matter, vol. 23, p. 34, Nov. 2011. [4] K. Walus, T. Dysart, G. Jullien, and R. Budiman, QCADesigner: A rapid design and simulation tool for quantum-dot cellular automata, IEEE Transaction on Nanotechnology, vol. 3, no. 1, March 2004. [5] J. Huang and F. Lombardi, Design and Test of Digital Circuits by Quantum-Dot Cellular Automat. Artech House Publishers, 2007. [6] M. Vacca, M. Graziano, and M. Zamboni, Asynchronous solutions for nano-magnetic logic circuits, ACM Journal on Emerging Technologies in Computing Systems, vol. 7, no. 4, December 2011. [7] M. Martina and G. Masera, Low-complexity, efcient 9/7 wavelet lters vlsi implementation, Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 53, no. 11, pp. 1289 1293, nov. 2006. [8] D. D. A. Pulimeno, M. Graziano and G. Piccinini, Towards a molecular qca wire: simulation of write-in and read-out systems, Solid-State Electronics, vol. 77, pp. 101107, November 2012. [9] J.W.Cooley and J.W.Tukey, An algorithm for the machine computation of the complex fourier series, Mathematics of Computation, vol. 19, pp. 297301, 1965. [10] M. Graziano, M. Vacca, D. Blua, and M. Zamboni, Asynchrony in quantum-dot cellular automata nanocomputation: Elixir or poison? IEEE Design & Test of Computers, 2011. [11] M. Vacca, M. Graziano, and M. Zamboni, Nanomagnetic logic microprocessor: Hierarchical power model, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. PP, no. 99, p. 1, 2012. [12] I. Hanninen and J. Takala, Pipelined array multiplier based on quantumdot cellular automata, in Circuit Theory and Design, 2007. ECCTD 2007. 18th European Conference on, aug. 2007, pp. 938 941. [13] W. Wang, K. Walus, and G. Jullien, Quantum-dot cellular automata adders, in Nanotechnology, 2003. IEEE-NANO 2003. 2003 Third IEEE Conference on, vol. 1, aug. 2003, pp. 461 464 vol.2.
Mol.
Mag.
Table I: Performance Estimation of Proposed QCA based FFT

processor: Technology process (Tech.), Molecular QCA technology (Mol.), Magnetic QCA technology (Mag.), Total number data points (N), Parallelism (P), Frequency (F), Area (A), Dots Power Dissipation (P), Latency in terms of clock cycles (Lat.)
Data obtained from Table I show that the advantages of molecular QCA are a very small area and an high theoretical frequency, while magnetic QCA are particularly indicated for low power applications. The achievable throughput of the designed processing unit is equal to one output sample per clock cycle. Thanks to the proposed design approach, scalable FFT architectures can be designed with no feedback loops. As a consequence of this characteristics, the low area and low power consumption advantages typically offered by QCA technology can be achieved with no penalty in terms of speed. VI. C ONCLUSIONS We have presented an innovative architecture for implementing FFT algorithm using QCA technology. This architecture is based on a partial parallel solution without the use of feedback signals. In this way, the problem of throughput reduction in QCA can be solved and the potential of this technology can
744

FFT Implementation Using QCA: Muhammad Awais, Marco Vacca, Mariagrazia Graziano and Guido Masera

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FFT Implementation Using QCA: Muhammad Awais, Marco Vacca, Mariagrazia Graziano and Guido Masera

Uploaded by

Copyright:

Available Formats

FFT Implementation using QCA

Muhammad Awais, Marco Vacca, Mariagrazia Graziano and Guido Masera

978-1-4673-1260-8/12/$31.00 2012 IEEE

x1 (r) = x(2r), x2 (r) = x(2r + 1), r = 0, 1, , N/2

Now the N point DFT in equation (2) can be expressed as

Considering the properties of twiddle factor we have

INPUTS [1]2 [0]2[1]1[0]1

0 [5]2 [4]2[5]1 [4]1 W8

P/2 W1 WP/2 sel1 W1,1 W1,p/2 selX WX,1 WX,p/2

P/2 outP-1 outP

[3]2 [2]2[3]1 [2]1

W81 W80 W81 W80 W83 W82 W83 W82

x[0] x[4] x[2] x[6] x[1] x[5] x[3] x[7]

W80 W80 W80 W80

W80 W82 W80 W82

W80 W81 W82 W83

INPUTS BLOCK [3]1 [1]1 [2]1[0]1 [7]1 [5]1 [6]1[4]1

PS #2 (4)1(0)1 XX XX 0 (5)1(1)1(4)1(0)1 1 (6)1 (2)1 XX XX XX 2 0 (7)1 (3)1(6)1 (2)1 XX XX 1

OUTPUTS {3}1{2}1 {1}1 {0}1 {7}1{6}1 {5}1 {4}1

W82 W80 W82 W80 W83 W81 W82 W80

W82 W80 W82 W80 W83 W81 W82 W80

L AYOUT AND R ESULTS

ADD 2 M1 SUB1 O1_R(0) O1_R(1)

B_R(0) B_R(1) B_i(1) B_i(0) SUB 2

W_R(0) W_R(1) A_R(1-0) A_I(1-0) W_I(0) W_I(1)

ADD 3 O1_I(0) O1_I(1) ADD 1 O2_I(O) O2_I(1)

Table I: Performance Estimation of Proposed QCA based FFT

You might also like