You are on page 1of 13

1148 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO.

5, MAY 2019

Optimum Circuits for Bit-Dimension Permutations


Mario Garrido , Member, IEEE, Jesús Grajal , Member, IEEE, and Oscar Gustafsson, Senior Member, IEEE

Abstract— In this paper, we present a systematic approach to


design hardware circuits for bit-dimension permutations. The
proposed approach is based on decomposing any bit-dimension
permutation into elementary bit-exchanges. Such decomposition
is proven to achieve the theoretical minimum number of delays
required for the permutation. This offers optimum solutions
for multiple well-known problems in the literature that make
use of bit-dimension permutations. This includes the design of
permutation circuits for the fast Fourier transform, bit reversal,
matrix transposition, stride permutations, and Viterbi decoders.
Index Terms— Bit-dimension permutation, bit reversal, data
management, fast Fourier transform (FFT), matrix transposition,
pipelined architecture, streaming data, Viterbi decoder. Fig. 1. Classification of permutations.

I. I NTRODUCTION At present, there exist optimum solutions for bit-dimension


permutations in terms of the number of delays [23], [24].
B IT-DIMENSION permutations [1] are permutations on
N = 2n data defined by a permutation of n bits that
represent the index of the data in binary. Bit-dimension per-
However, the interconnection between registers is complex and
requires a large number of wires and multiplexers. There also
mutations are a wide category that includes, among other per- exist optimum solutions for specific bit-dimension permuta-
mutations, the perfect shuffle [2], matrix transposition [3], [4], tions. For bit reversal, optimum circuits have been proposed
stride permutations [5]–[8], and bit reversal [9]–[15], as shown for serial [9] and parallel data [10]. Likewise, there are circuits
in Fig. 1. Regarding their use, bit-dimension permutations are with the minimum number of delays for matrix transposition
used in important signal processing algorithms such as the fast and other stride permutation [8]. However, there is no general
Fourier transform (FFT) [15]–[21] and Viterbi decoders [22]. solution in the literature that provides the minimum number
One way to design digital circuits that carry out bit- of delays as well as a reduced number of multiplexers for any
dimension permutations is to use lifetime analysis and register bit-dimension permutation.
allocation [23], [24]. This approach determines the content of In this paper, we present a systematic approach to design
the registers used for the permutation at each time instant, hardware circuits for bit-dimension permutations, different
leading to efficient use of the registers. to the commonly used Kronecker products [8], [25]. The
More recent works use memories [3], [13], [25]–[31] or proposed approach leads to circuits with an optimum number
delays (buffers or registers) [3], [6]–[10] to carry out the of delays for any bit-dimension permutation. This has several
permutations. The approaches based on memories consist of implications. First, the proposed approach widens the scope
a memory bank in parallel and multiplexers at the input with respect to previous papers that only focus on specific
and output of the memories. The multiplexers decide to/from permutations such as bit reversal or matrix transposition.
which memory data is written/read. The approaches based on Second, it provides optimum solutions in terms of delays for
delays include delays and multiplexers in series and in parallel. a wide range of permutations. Furthermore, it reduces the
number of multiplexers with respect to previous approaches
Manuscript received June 30, 2018; revised October 5, 2018 and based on delays. As a result, the proposed approach is
November 23, 2018; accepted December 25, 2018. Date of publication a systematic and optimized solution for a large group of
February 4, 2019; date of current version April 24, 2019. This work was sup-
ported in part by the Swedish ELLIIT Program, in part by the FPU Fellowship permutations.
AP2005-0544 of the Spanish Ministry of Education, in part by the Spanish This paper is organized as follows. Section II briefly reviews
National Research and Development Program under Project TEC2014-53815- the concept of bit-dimension permutations. Section III explains
R, and in part by the Madrid Regional Government under Project S2013/ICE-
3000 (SPADERADAR-CM). (Corresponding author: Mario Garrido.) how to model a continuous data flow, which is the basis of
M. Garrido and O. Gustafsson are with the Department of Electrical the proposed approach. Section IV presents the circuits for
Engineering, Linköping University, 581 83 Linköping, Sweden (e-mail: elementary bit-exchange (EBE), which are the basic circuits
mario.garrido.galvez@liu.se; oscar.gustafsson@liu.se).
J. Grajal is with the Department of Signal, Systems, and Radiocommu- that we use to carry out bit-dimension permutations. Section V
nications, Universidad Politécnica de Madrid, 28040 Madrid, Spain (e-mail: describes how to calculate the cost of a bit-dimension
jesus.grajal@upm.es). permutation. Section VI presents the theoretical minimum
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. latency and number of delays for a bit-dimension permuta-
Digital Object Identifier 10.1109/TVLSI.2019.2892322 tion. Section VII shows how to derive optimum circuits for
1063-8210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1149

bit-dimension permutations. Section VIII compares the pro-


posed approach to previous approaches in the literature.
Section IX summarizes the main conclusions and the Appen-
dix shows an example on how to use the proposed approach.

II. B IT-D IMENSION P ERMUTATIONS Fig. 2. Definition of serial and parallel dimensions in a data flow.
Let us consider a set of N = data, n ∈ N, in an
2n
n-dimensional space x n−1 x n−2 . . . x 0 , where x i ∈ {0, 1}. In this Fig. 2 shows the definition of serial and parallel dimensions
context, a bit-dimension permutation, σ , defines a reordering in a data flow. As a convention, data flow from left to right,
of the data according to a permutation of the coordinates in x 0 to x p−1 are parallel dimensions and x p to x n−1 are serial
the space [1]. This allows for defining the permutation on a ones. This means that there are p parallel dimensions and
set of n elements instead of defining it for 2n values, which n − p serial dimensions. This also means that the data flow
is, most of the times, mathematically inaccessible [32]. is modeled as a rectangle of 2 p data in parallel times 2n− p
In general, a bit-dimension permutation is a permutation in series.

σ (u n−1 u n−2 . . . u 0 ) = u σ (n−1) u σ (n−2) . . . u σ (0) (1)


B. Position, Time, and Terminal
which transforms a point in the space u n−1 u n−2 . . . u 0 into In the data flow, we can define the position occupied by
a new point u σ (n−1) u σ (n−2) . . . u σ (0) whose coordinates are a each datum. As the number of data is N = 2n , the positions
permutation of the coordinates of the original point. Before are numbered from 0 to 2n − 1 according to
the permutation, x i = u i and, after the permutation, x i =
u σ (i) . Thus, σ (i ) defines the element u σ (i) that is moved to 
n−1

x i . Likewise, the inverse permutation σ −1 (i ) indicates that the P= x i 2i . (5)


i=0
element in x i is moved to x σ −1 (i) . Finally, it is fulfilled that
σ ◦ σ −1 = σ −1 ◦ σ = Id. In other words, we can say that P ≡ x n−1 x n−2 . . . x 0 , where
(≡) relates the decimal and binary representations of a number.
A. Types of Bit-Dimension Permutations In the data flow, we can define the time of arrival and the
A bit-dimension permutation that only involves two dimen- terminal of the datum in any position P. The time of arrival
sions,1 x j and x k , and exchanges their coordinates is called is calculated as
EBE [32]. This EBE can be represented as σ : x j ↔ x k [9]. 
n−1
Throughout this paper, we also use the notation ( j k) to t (P) = x i 2i− p (6)
represent this EBE. i= p
A perfect shuffle [2] is a circular permutation of one bit to and the input terminal is
the left

p−1
σPS (u n−1 u n−2 . . . u 1 u 0 ) = u n−2 . . . u 1 u 0 u n−1 . (2) T (P) = x i 2i . (7)
i=0
Likewise, a perfect unshuffle is a circular permutation of one
bit to the right Note that t (P) is the time of arrival relative to the arrival
of the first sample at a given point of the circuit. Therefore,
σPU (u n−1 u n−2 . . . u 1 u 0 ) = u 0 u n−1 u n−2 . . . u 1 . (3) t (P) = 0 means that the sample in position P is the first
A stride-by-2s permutation [5], [7] is a circular permutation one to arrive at that point of the circuit.
of s bits to the left, and it can be expressed as a composition Note that the input terminal is only determined by the
of s perfect shuffles parallel dimensions, whereas the time of arrival only depends
on the serial ones. In addition, there exists a total of 2 p
σS2s = σPS ◦ . . . ◦ σPS = (σPS )s . (4) terminals, which are numbered from T = 0 to T = 2 p − 1
  
s times according to (7), and all the data arrive in 2n− p clock cycles,
from t = 0 to t = 2n− p − 1 according to (6).
III. M ODELING A H ARDWARE DATA F LOW Finally, the vertical bar (|) is used throughout this paper to
In this section, we propose a new model to describe a hard- separate the serial and parallel dimensions. According to this,
ware data flow. The model considers a continuous data flow we can represent the position as
of N = 2n data in an n-dimensional space x n−1 x n−2 . . . x 0 .
P = t | T. (8)
A. Serial and Parallel Dimensions Example: Fig. 3 shows a data flow with three dimensions.
In a hardware circuit, data flows in series and/or in parallel. One of them is parallel and two are serial. The position is indi-
Data flowing in series are provided to the same terminal at cated in parenthesis and numbered from 0 to 7. The time and
different clock cycles. Data flowing in parallel are provided at terminal are also indicated in this figure. For instance, position
the same time to different terminals. P = 5 corresponds to x 2 x 1 x 0 = 101. As p = 1, the time
1 In this paper, the word “dimension” refers to a direction in which we can of arrival is t (P) = 2 ≡ 10 = x 2 x 1 = x n−1 . . . x p . Likewise,
move in the space. the terminal is T (P) = 1 ≡ 1 = x 0 = x p−1 , . . . , x 0 .

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

If σ is an EBE that calculates σ (P0 ) = P1 , then P0 and


P1 only differ in the coordinates that are exchanged, that is

P0 ≡ u n−1 . . . u j +1 u j u j −1 . . . u k+1 u k u k−1 . . . u 0


P1 ≡ u n−1 . . . u j +1 u k u j −1 . . . u k+1 u j u k−1 . . . u 0 . (11)

As x i ∈ {0, 1}, samples for which x j = x k will remain


in the same position, as P1 = P0 . Conversely, if x j = x k ,
the input position corresponds to one of these options
Fig. 3. Data flow with one parallel and two serial dimensions. Position is
shown in parenthesis and is related to the time (t) and the terminal (T ). The P0 A ≡ u n−1 . . . u j +1 0u j −1 . . . u k+1 1u k−1 . . . u 0
data indexes are shown by the boxed numbers without parenthesis. This data
flow is defined by P ≡ b0 b1 |b2 . P0B ≡ u n−1 . . . u j +1 1u j −1 . . . u k+1 0u k−1 . . . u 0 . (12)

If the initial position is P0 = P0 A , the EBE moves the sample


C. Data Flow With Indexed Data to P1 = P0B . If P0 = P0B , the output position is P1 =
Signal processing algorithms define mathematical opera- P0 A . Therefore, pairs of samples whose position only differ in
tions on indexed data. In our approach, we define the index x j and x k are swapped. As a result, an EBE changes the posi-
of the data as I ≡ bn−1 bn−2 . . . b0 or, equivalently tion of half of the N samples, i.e., those for which x j = x k .
The rest of the samples are unaffected by the permutation and

n−1
I = bi 2i . (9) keep their positions.
i=0
Note that this holds independently of the serial or parallel
nature of the dimensions. However, depending on this nature,
Thus, I represents the decimal value of the index and bi are we can define three different cases: either both dimensions are
the bits of its binary representation. parallel, or both are serial, or one of them is serial and the
In a data flow with indexed data, the position is defined as other one is parallel. Each of these cases leads to different
a function of bi . This allows to assign each indexed data to a shuffling circuits, as shown next.
position in the data flow.
Example: In the data flow shown in Fig. 3, the data indexes
are shown by the boxed numbers without parenthesis and the B. Parallel–Parallel EBE
positions are in parenthesis. The definition P ≡ b0 b1 |b2 If x j and x k are parallel, then p > j > k. This leads to
allows to know the position of each index in the data flow.
For instance, I = 6 ≡ 110 = b2 b1 b0 is in position P ≡ P0 ≡ u n−1 . . . u p |u p−1 . . . u j . . . u k . . . u 0
b0 b1 |b2 = 01|1 ≡ 3. This also holds for any other index in P1 ≡ u n−1 . . . u p |u p−1 . . . u k . . . u j . . . u 0 (13)
the data flow.
and the pairs of inputs whose position must be exchanged are
D. Permuting a Continuous Data Flow P0 A ≡ u n−1 . . . u p |u p−1 . . . 0 . . . 1 . . . u 0
A shuffling circuit is represented by a function σ . The P0B ≡ u n−1 . . . u p |u p−1 . . . 1 . . . 0 . . . u 0 . (14)
permutation that a circuit carries out is defined as σ (u) and
can be applied to a data flow with or without indexed data. As no serial dimensions are involved in the permutation,
If the input order is P0 and the output order is P1 , then the difference in time between two inputs that must be
switched is t = t (P0B ) − t (P0 A ) = 0. Therefore, pairs
σ (P0 ) = P1 . (10) of data whose position must be exchanged are received at the
Example: σ (u 2 u 1 |u 0 ) = u 2 u 0 |u 1 defines the permutation same time at different terminals of the shuffling structure. This
of a shuffling circuit. When applying it to the input order means that the permutation can be carried out by rearranging
P0 ≡ b0 b1 |b2 , we obtain the output order P1 ≡ b0 b2 |b1 . the inputs at each time instant and the circuit does not need
any delay element to store samples. In addition, according
IV. H ARDWARE C IRCUITS FOR E LEMENTARY to (13), input data at terminal T0 ≡ u p−1 . . . u j . . . u k . . . u 0
B IT-E XCHANGE are always forwarded to T1 ≡ u p−1 . . . u k . . . u j . . . u 0 . Con-
sequently, a parallel–parallel EBE can be simply carried out
Any bit-dimension permutation can be decomposed into by an interconnection between each input terminal, T0 , and
a series of EBEs [32]. This principle is followed in this the corresponding output one, T1 .
paper to design circuits for bit-dimension permutation. This
section describes the hardware circuits used to calculate EBEs.
Sections V–VII explain how to use the EBEs to create circuits C. Serial–Serial EBE
for bit-dimension permutations. If both dimensions x j and x k are serial, then j > k ≥ p.
This leads to
A. General Considerations
An EBE exchanges the coordinates of two dimensions, x j P0 ≡ u n−1 . . . u j . . . u k . . . u p |u p−1 . . . u 0
and x k . Without loss of generality, let us assume that j > k. P1 ≡ u n−1 . . . u k . . . u j . . . u p |u p−1 . . . u 0 (15)

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1151

Fig. 4. Basic circuit for a serial–serial EBE.

and the pairs of inputs whose position must be exchanged are


Fig. 5. Basic circuit for a serial–parallel EBE.
P0 A ≡ u n−1 . . . 0 . . . 1 . . . u p |u p−1 . . . u 0 TABLE I
P0B ≡ u n−1 . . . 1 . . . 0 . . . u p |u p−1 . . . u 0 . (16) C OSTS OF EBE S

This means that pairs of input data that must be inter-


changed arrive at the same input terminal, because T (P0 A ) =
T (P0B ), and they are separated a constant number of clock
cycles
t = t (P0B ) − t (P0 A ) = (2 j − 2k )/2 p . (17)
Fig. 4 shows the circuit to carry out a serial–serial EBE. Likewise, they arrive at different time instants at the circuit,
It consists of a buffer of length being
L = t = (2 j − 2k )/2 p (18)
t = t (P0B ) − t (P0 A ) = 2 j − p . (23)
and two multiplexers controlled by the same control signal, S.
The latency of the circuit is equal to the length of the buffer, In order to do the swapping, the input sample at terminal
i.e., Lat = L. This is the minimum number of delays that T (P0 A ), which arrives first, must wait t clock cycles until
make the circuit causal and, therefore, implementable. the other sample arrives. Fig. 5 shows the circuit that permutes
The control signal of the multiplexers depends on the serial a parallel dimension with a serial one. It consists of two
dimensions that are involved and is obtained as buffers and two multiplexers, where the length of each buffer
is directly determined by
S = x j OR x k . (19)
L = t = 2 j − p (24)
Note that S = 0 only if x j = 1 and x k = 0, i.e., when a sample
in position P0B is at the input of the circuit and sample in and the control signal is
position P0 A = P0B − t = P0B − L is at the output of the
buffer. As S = 0, both samples are interchanged. Otherwise, S = xj. (25)
S = 1 and data are not permuted.
When there exist parallel dimensions, the circuit is repli- If there is more than one parallel dimension, the circuit
cated in parallel for each input terminal. In this general case, in Fig. 5 is replicated in parallel 2 p−1 times. This leads to a
the total number of delays is total number of delays

D(σ ) = 2 p · L = 2 p · t = 2 j − 2k . (20) D(σ ) = 2 p−1 · 2L = 2 j . (26)

Permutations of serial data are used for bit reversal [9] and Examples of serial–parallel permutations can be found in
for the serial commutator FFT [20], [21]. the parallel feedforward FFT architectures [16], [17].

D. Serial–Parallel EBE E. Implementation of the Delays Using Memories


When the dimension x j is serial and x k is parallel, then Although the circuits have been described in terms of
j ≥ p > k. This leads to delays, these delays can be implemented in hardware by
P0 ≡ u n−1 . . . u j . . . u p |u p−1 . . . u k . . . u 0 memories that act as buffers. In fact, several delays in the
permutation circuits can be grouped together to form a bigger
P1 ≡ u n−1 . . . u k . . . u p |u p−1 . . . u j . . . u 0 (21) memory [33]. This reduces the power consumption and the
and the pairs of inputs whose position must be exchanged are area of the circuit if the number of delays is large [34].

P0 A ≡ u n−1 . . . 0 . . . u p |u p−1 . . . 1 . . . u 0 V. C OST OF A B IT-D IMENSION P ERMUTATION


P0B ≡ u n−1 . . . 1 . . . u p |u p−1 . . . 0 . . . u 0 . (22)
The costs of the circuits in Section IV are summarized
Accordingly, pairs of input samples that must be interchanged in Table I. The cost is shown in terms of a total number of
arrive at different terminals because T (P0 A ) = T (P0B ). delays, D, buffer length and latency, L, and multiplexers, M.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1152 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

Fig. 6. Alternative circuits to calculate the permutation σ (u 2 u 1 |u 0 ) = u 1 u 0 |u 2 . (a) σ = σ1 ◦ σ2 . (b) σ = σ3 ◦ σ1 . (c) σ = σ2 ◦ σ3 .

For any bit-dimension permutation, the number of delays σ3 ◦ σ1 , and Lat = 6/21 = 3 for σ2 ◦ σ3 . Finally, the number
is calculated from the cost of the EBEs that it consists of, of multiplexers is
according to
M(σ1 ◦ σ2 ) = M(σ1 ) + M(σ2 ) = 2 + 4 = 6

Q 
D(σ ) = D(σq ) − min(D(σr ), D(σr+1 )) (27) M(σ3 ◦ σ1 ) = M(σ3 ) + M(σ1 ) = 2 + 2 = 4
q=1 r M(σ2 ◦ σ3 ) = M(σ2 ) + M(σ3 ) = 4 + 2 = 6. (31)
where Q is the total number of EBEs and r corresponds to the As a result, the permutation with the lowest cost is σ3 ◦ σ1 .
values for which σr and σr+1 are both serial–parallel EBEs that
share the parallel dimension. The latency of a bit-dimension VI. T HEORETICAL L IMITS
permutation is related to the number of delays by the number
of parallel samples P = 2 p , i.e., Lat = D(σ )/2 p . Finally, A. Minimum Latency
the number of multiplexers is equal to the sum of the number The latency of a permutation circuit is the difference
of multiplexers of the individual permutations, that is between the time when the first input arrives to the circuit
and the time when the first output is provided.

Q
The time that a certain input is inside the permutation
M(σ ) = M(σq ). (28) circuit, t I , is equal to the time of departure, t (P1 ), minus
q=1
the time of arrival, t (P0 ), plus the circuit latency. Note that
Example: The perfect shuffle σ (u 2 u 1 |u 0 ) = u 1 u 0 |u 2 can the time of arrival/departure defined in Section III is referred
be calculated in three ways depending on how we break down to the arrival/departure of the first input–output. This gives
the permutation, i.e., σ = σ1 ◦ σ2 = σ3 ◦ σ1 = σ2 ◦ σ3 , where
t I = t (P1 ) − t (P0 ) + Lat ≥ 0 (32)
σ1 : x 1 ↔ x 0 , σ2 : x 2 ↔ x 1 and σ3 : x 2 ↔ x 0 are EBEs. The
permutations σ1 and σ3 are serial–parallel and the permutation where the time t I needs to be greater than or equal to zero to
σ2 is serial–serial. According to Table I and considering that make the circuit causal. This leads to
p = 1, the number of delays and multiplexers of the EBEs is
Lat ≥ t (P0 ) − t (P1 ) ∀ data. (33)
D(σ1 ) = 21 = 2, M(σ1 ) = 2 Therefore, the minimum latency is
D(σ2 ) = 22 − 21 = 2, M(σ2 ) = 4
Latmin = max(t (P0 ) − t (P1 )). (34)
D(σ3 ) = 22 = 4, M(σ3 ) = 2. (29)
Consequently, the minimum latency is set by the datum
By implementing the EBEs according to Section IV, that arrives the latest with respect to the time in which it
the three circuits to calculate σ in Fig. 6 are obtained. The should be provided, which will force the rest of data to wait
number of delays for the three implementations according for it.
to (27) is

D(σ1 ◦ σ2 ) = D(σ1 ) + D(σ2 ) = 2 + 2 = 4 B. Minimum Number of Delays


D(σ3 ◦ σ1 ) = D(σ3 ) + D(σ1 ) − D(σ1 ) = 4 + 2 − 2 = 4 The outputs of a circuit are provided Latmin clock cycles
D(σ2 ◦ σ3 ) = D(σ2 ) + D(σ3 ) = 2 + 4 = 6. (30) after the inputs are received. The minimum number of delays
is then equal to the amount of data stored during this time
In the case of σ3 ◦ σ1 , both permutations are serial–parallel,
Dmin = Latmin · P = max(t (P0 ) − t (P1 )) · P (35)
so we subtract the minimum cost among them, which is D(σ1 ).
This fact can be observed in Fig. 6(b), where two delays in where P = 2 p is the number of parallel inputs. For ser-
the parallel branches between multiplexers can be removed, ial data (P = 1), this lower bound was already derived
thanks to pipelining, leading to a total of four delays. The in [23]. For the general case, we continue the analysis as
latency is then Lat = 4/21 = 2 clock cycles for σ1 ◦ σ2 and follows.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1153

By substituting (6) in (35) and taking into account that Algorithm 1 Theoretical Minimum Number of Delays
x i = u i at the input and x i = u σ (i) at the output according
to (1), we obtain
⎛ ⎞
n−1 
n−1
Dmin = max ⎝ u i 2i − u σ (i) 2i ⎠. (36)
i= p i= p

The permutation σ is bijection and the relationship between


i and σ (i ) is the same as the relationship between σ −1 (i ) and
i . Therefore, by applying the variable change i → σ −1 (i ) to
the second summation, we obtain
⎛ ⎞

n−1 
n−1
−1
Dmin = max ⎝ u i 2i − u i 2σ (i) ⎠. (37)
i= p σ −1 (i)= p

As u i ∈ {0, 1}, then each term of the first summation will


add a positive number or zero and each term of the second
summation will subtract a positive number or zero. When 2i >
−1
2σ (i) , the sum of the terms corresponding to the same u i will
−1
be positive or zero, and if 2i < 2σ (i) , the sum of the terms
corresponding to the same u i will be negative. Therefore,
the maximum will occur when u i = 1 for i > σ −1 (i ) and
u i = 0 for i < σ −1 (i ). This leads to values in the columns that correspond to that cycle. Likewise,
  −1 the minimum number of delays in a bit-dimension permutation
Dmin = 2i − 2σ (i) . (38) is obtained as the sum of the minimum number of delays of
i≥ p σ −1 (i)≥ p the individual cycles. The example in the Appendix clarifies
i>σ −1 (i) i>σ −1 (i) all of this fact.
This equation results in Algorithm 1, which is used
VII. O PTIMUM C IRCUITS FOR B IT-D IMENSION
to calculate the cost of the optimum permutation. Equa-
P ERMUTATION
tion (39) shows how to apply Algorithm 1 to the permutation
σ (u 4 u 3 u 2 u 1 |u 0 ) = u 2 u 1 u 0 u 4 |u 3 . In this case, p = 1 and In this section, we propose a methodology to obtain opti-
σ −1 (u 4 u 3 u 2 u 1 |u 0 ) = u 1 u 0 u 4 u 3 |u 2 mum circuits for bit-dimension permutations. First, the per-
mutation is broken down into cycles. Then, for each cycle,
u4 u3 u2 u1 u0 the optimum permutation is obtained, which depends on the
16 8 4 2 0 Input weight serial or parallel nature of the dimensions involved.
− 2 0 16 8 4 Output weight
(39)
14 8 −12 −6 −4 Subtraction A. Decomposing the Permutation Into Cycles

For the optimization purpose, permutation cycles can be
22 Total. treated independently. Each cycle can be one among the
Finally, the upper bound of (38) for a permutation of N = 2n following three types.
elements and P = 2 p parallel streams is 1) The cycle only includes parallel dimensions.
⎧ √ √ 2) The cycle only includes serial dimensions.

⎪ N − 2 N + P, P < N and n even

⎨  3) The cycle includes serial and parallel dimensions.
√ N √
DUB = N − 2N − + P, P < N and n odd The case when the cycle only includes parallel dimensions

⎪ 2

⎩ √ is straightforward: the optimum circuit simply consists of
N − P, P ≥ N. connecting each input terminal to the corresponding output
(40) terminal, as discussed in Section IV-B. For the other two cases,
Sections VII-B and VII-C show how to achieve the optimum
Note that the number of delays is always smaller than N,
circuit.
i.e., DUB < N.
Also, note that the number of EBEs of each cycle is one
less than the number of dimensions that the cycle involves.
C. Dealing With Cycles Therefore, if c is the number of cycles in the permutation,
Permutations can be broken down into cycles. Different the total number of EBEs of a permutation is
cycles in bit-dimension permutation do not share any dimen-
#EBEs(σ ) = n − c. (41)
sion. As cycles are not mixed, when calculating the minimum
number of delays according to Algorithm 1 each column can B. Cycles With Only Serial Dimensions
only correspond to one cycle. Thus, the minimum number 1) Problem With Elevators: The optimization for cycles
of delays for a cycle is obtained by adding the positive with only serial dimensions is analogous to the following

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1154 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

Fig. 7. All the cases in which two elevators moving in different directions can be.

problem with elevators. Once the problem of elevators is Algorithm 2 Obtaining the Optimum Permutation for Cycles
understood, it is easy to apply it to our optimization problem. With Only Serial Dimensions
Let us assume that a building has n elevators that can move
between floors F = 0 and F = n − 1. Each elevator can be
in any of the n floors. However, there is always one elevator
at each floor.
Each elevator has a number. This number corresponds to
the floor that the elevator must reach.
Elevators can move. The movement is done in pairs of
elevators that change floor. This exchange is done to respect
the rule of one elevator per floor. The cost of moving one
elevator along several floors depends on the initial and final
floors. As long as the elevator moves toward its final floor,
the cost will be the same independent of the number of stops
in intermediate floors. However, if an elevator gets further than
its target floor, there will be an extra cost. 2) Optimizing Cycles With Only Serial Dimensions: The
For this problem, we want to calculate the most efficient optimization of cycles with only serial dimensions is done by
movements in order to make all the elevators reach their translating it to a problem of the elevators. This is possible
destination floors. because for serial–serial permutations, the cost of moving up
Solution: As pairs of elevators exchange floors, in order to (analogously down) any u i from x k to x j is the same when
move them, it is necessary that one of them moves down and it is done directly, i.e., D = 2 j − 2k , and when there are
the other one moves up. It is also necessary that each of them intermediate stops, e.g., by stopping in h, j > h > k, the cost
reaches the floor where the other one was. is D = (2h − 2k ) + (2 j − 2h ) = 2 j − 2k as before.
Fig. 7 shows all the cases in which two elevators moving Once the problem has been translated into a problem with
in different directions can be. The elevator j is in floor F( j ) elevators, the next step is to identify allowed movements that
and aims to reach floor j . The elevator k is in floor F(k) and respect the properties j ≥ F(k), F(k) > F( j ), and F( j ) ≥ k,
aims to reach floor k. Without loss of generality, we consider which guarantee the minimum cost. Each of these movements
that j > k. This means that the elevator j moves up and leads to a new building and each of these buildings creates a
the elevator k moves down. In other words, it is fulfilled that branch of a tree.
F( j ) < j and F(k) > k. Then, the process repeats for each branch of the tree and
Among the cases in Fig. 7, in (a), (b), and (c), j cannot reach continues until all the elevators reach their destination.
F(k) without surpassing its destination. In cases (c), (e), and In the end, each branch of the tree represents an optimum
(m), k cannot reach F( j ) without surpassing its destination. permutation. The sequence of permutations to reach one
In case (o), a swap of the floors would only make the elevators optimum is obtained by following the tree from the top to
further than their destination. Therefore, the cases that advance the end of any branch.
to the destinations without incurring in additional costs are the If, instead of obtaining all optimum solutions, we only need
cases (d), (g), (h), and (n). These cases share the properties: one of them, Algorithm 2 obtains such permutation. This
j ≥ F(k), F(k) > F( j ), and F( j ) ≥ k. Therefore, the mini- algorithm searches for feasible movements of the elevators
mum cost is achieved as long as the movements fulfill these and, when it finds one, it collects the EBE, does the swap
properties. corresponding to that EBE and continues from that point,

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1155

Fig. 8. Obtaining all the optimum permutations for σ (u 4 u 3 u 2 u 1 u 0 ) = u 1 u 4 u 0 u 2 u 3 .

i.e., it does not search all the optimum cases but follows the movements. Any step of the algorithm consists of one or more
first that it finds. cycles. If a given cycle involves only two dimensions, then it
Example: Let us consider σ (u 4 u 3 u 2 u 1 u 0 ) = will be equal to the case in Fig. 7(g). If the cycle involves
u 1 u 4 u 0 u 2 u 3 . First, it is translated into the building with more than two dimensions, then the upper part of the cycle
elevators at the top of Fig. 8. The floor in which each elevator must look like the case in Fig. 7(d) or (e). They, together
starts is equal to the numbers on the top of the building, which with Fig. 7(g), are the only cases that can create the upper
corresponds to the subindex i of u i at the output of the given part of the cycle. The case in Fig. 7(d) is already one of the
permutation. valid movements. For Fig. 7(e), if F( j ) is the lowest floor
An allowed movement is to swap the elevators 4 and 0, in the cycle, i.e., F( j ) = Fmin , the elevator j will go from
which are in floors 3 and 1, respectively. Another allowed the lowest to the highest floor. This forces that the bottom
movement is to swap the elevators 4 and 1, which are in floors part of the cycle is closed with Fig. 7(h), which is a valid
3 and 2, respectively. These two cases lead to the two buildings movement. In Fig. 7(e), if F( j ) > Fmin , there will exist an
to the sides of the top one. Then, the process repeats for each elevator h < F( j ) that comes from F(h) > F( j ), which
of the resulting buildings, until the tree is finished. allows for reaching the floors under F( j ). Otherwise, the cycle
In the end, any branch of the tree represents an optimum could not be closed. In this case, we can apply the movement
movement. For instance, by going from the top to the most in Fig. 7(n) to the elevators j and h. Therefore, any step of
left branch, we obtain the EBEs (3 1), (1 0), (2 1), and (4 3). the algorithm has at least a valid movement. This guarantees
It can be checked that this sequence of EBEs carries out the that the algorithms always reach the optimum permutations.
desired permutation and its cost in terms of delays is 17, which
corresponds to the theoretical minimum in Section VI.
3) Proof of Optimality: To proof optimality, we know C. Cycles With Serial and Parallel Dimensions
that in our solution, we only consider the movements 1) Optimizing Cycles With Serial and Parallel Dimensions:
in Fig. 7(d), (g), (h), and (n). All these movements move ele- When a cycle includes serial and parallel dimensions, the cir-
vators closer to their final floor and guarantee that the final cost cuit is optimized by using one of the parallel dimensions as
is optimum since no cost apart from the minimum is intro- a pivot. Thus, all the EBEs are carried out between the pivot
duced. Furthermore, by following any of these movements, dimension and another dimension. This transforms all serial–
the final cost is the same, as it is independent on the stops serial EBEs into serial–parallel, which follows the ideas in
in the intermediate floor. What remains to proof is that at any Section V and results in less multiplexers and equal or less
step of the algorithm, we can always apply at least one of these delays than using serial–serial permutations.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1156 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

TABLE II
C OST OF IN T ERMS OF D ELAYS /M EMORY AND M ULTIPLEXERS OF S EVERAL B IT-D IMENSION P ERMUTATIONS

Algorithm 3 Obtaining the Optimum Permutation for Cycles where sC is the number of serial dimensions in the cycle and
With at Least One Parallel Dimension sC − 1 is equal to the number of EBEs in the cycle.
In cycles with at least one parallel dimension, the optimum
permutation consists of a sequence of serial–parallel permuta-
tions. These permutations require one multiplexer per parallel
branch and per serial–parallel permutation, that is
M = sC P. (45)
Based on this, for any bit-dimension permutation, the upper
bound for the number of multiplexers in the proposed
approach is
 
N
MUB = 2P log2 . (46)
2P

The order of the EBEs that must be carried out is obtained VIII. C OMPARISON
easily. Starting with the pivot dimension, the value u i is Table II compares the cost of several bit-dimension permu-
moved to its corresponding place at the output. This not only tations in terms of delays/memory and multiplexers. The first
allocates u i in its place but also moves another u i to the pivot column shows the references. The second column shows if
dimension. Next, u i is moved to its corresponding place at the the approach is memory-based or delay-based. The remaining
output, and a new u i is moved to the pivot dimension. The columns show the cost in terms of delays/memory and two-
procedure continues in the same way until all the values reach input multiplexers for the permutations (A), (B), (C), and (D)
their place. Note that i = σ −1 (i ) and i = σ −2 (i ) according under study.
to the definition in Section II.
The previous procedure results in Algorithm 3. An example √ Case (A) is a P × P matrix transposition with P =
N , which corresponds to σ (u n−1 . . . u n/2 |u n/2−1 . . . u 0 ) =
of the application of this algorithm is given in the Appendix. u n/2−1 . . . u 0 |u n−1 . . . u n/2 . It is implemented with n/2 serial–
2) Proof of Optimality: All the resulting permutations are parallel EBEs σ : x i ↔ x i+n/2 , i = 0, . . . , n/2 − 1, and its
serial–parallel or parallel–parallel. By including the costs cost is
in Table I in (27), the cost of the resulting permutation is
  
 
n/2−1

D= 2i − min 2i , 2i (42) D(σ ) = 2i+n/2 = N − N (47)
i≥ p i≥ p i=0
i ≥ p
    
n/2−1


D= 2i + 2i − 2i − 2i . (43) M(σ ) = 2 p = P log2 P (given that P = N ). (48)
i≥ p i≥ p i >i≥ p i>i ≥ p i=0
i≥i i<i For this permutation, the proposed approach and other
As = i σ −1 (i ),
this corresponds to the minimum number of delay-based approaches in Table II have less complexity
delays in (38). than memory-based approaches either in the amount of
delays/memory, or the number of multiplexers, or both.
D. Number of Multiplexers The permutation (B) is a bit reversal of N data arriving
in P parallel streams with N > P 2 . By using the proposed
In cycles that only include serial dimensions, all the EBEs
approach, the bit reversal is broken down into the EBEs σi :
are serial–serial. Therefore, each serial path includes two
x i ↔ x n−1−i , i = 0, . . . , n/2 − 1.
multiplexers per EBE, leading to a total of
For either parallel data with N > P 2 or serial data,
M = (sC − 1)2P (44) the permutation consists of p serial–parallel EBEs σi : x i ↔

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1157

x n−1−i , i = 0, . . . , p − 1, and n/2 − p serial–serial EBEs


σi : x i ↔ x n−1−i , i = p, . . . , n/2 − 1. The cost of the
circuit is

p−1  
n/2−1 
D(σ ) = 2n−1−i + 2n−1−i − 2i (49)
i=0 i= p

which results in
⎧ √
⎨N − 2 N + 
P, n even
D(σ ) = √ N (50)
⎩ N − 2N − + P, n odd
2
and
⎧  

⎪ N
⎨ P log2 , n even
M(σ ) = p2 p +n/2 − p2 p+1 = P 

⎪ N
⎩ P log2 , n odd.
2P
(51)
The resulting circuits for serial and parallel bit reversal using
the proposed approach are the same as those in [9] and [10],
respectively. Therefore, the proposed approach is capable of
obtaining optimum circuits for bit reversal for any P, and [9]
and [10] are only specific cases of the framework provided in
this paper.
Also, note that the cost of the bit-reversal permutation, both
for serial and parallel data, corresponds to the upper bound
defined in (40). This means that the bit reversal is the most
costly bit-dimension permutation.
Compared to previous approaches in Table II, the proposed
approach requires less delays/memory than previous memory-
based approaches. As we consider N > P 2 , the P log2 P
multiplexers in [27] are less than the P log2 (N/P) multiplex-
ers of the proposed approach. Therefore, there is a tradeoff
between delays/memory and multiplexers.
The permutation (C) is σ (u 4 u 3 u 2 u 1 |u 0 ) = u 2 u 1 u 0 u 4 |u 3 ,
which is a stride permutation that has been used in [8].
Fig. 9(b) and (c) shows the proposed solution and the timing
diagram, respectively. In this case, memory-based approaches
require less multiplexers at the cost of noticeably more
delays/memory. All delay-based approaches require the theo-
retical minimum amount of delays/memory, and the proposed
approach requires the least amount of multiplexers among Fig. 9. Circuits for the permutation σ (u 4 u 3 u 2 u 1 |u 0 ) =
them. u 2 u 1 u 0 u 4 |u 3 . (a) Using the theory of stride permutations in [8, Fig. 14(f)].
(b) Using the proposed approach. (c) Timing diagram of the proposed
The permutation (D) is σ (u 4 u 3 u 2 |u 1 u 0 ) = u 3 u 0 u 1 |u 4 u 2 , approach.
which is not a stride permutation. The proposed solution
is shown in Fig. 10. In this case, the proposed approach
saves 68% of the memory and uses 50% more multiplexers delays of the proposed circuits among all the permutations
with respect to those in [25] and [27] and saves 37% of with the corresponding dimensions and parallelization, nor-
the memory plus 25% of the multiplexers with respect to malized to N. As some memory-based approaches [26]–[29]
that in [26] and [27]. Previous delay-based approaches do not require a total memory of N, the values of the graph corre-
consider this permutation [8] or require a large number of spond to the ratio between the delays/memory of the proposed
multiplexers [24]. approach and in those memory-based approaches.
Finally, there are some general conclusions. On the one On the other hand, the proposed approach reduces the
hand, the proposed approach reduces the memory requirement number of multiplexers compared to previous delay-based
with respect to memory-based approaches, and in most of approaches [8], [24] while having the minimum number of
the cases, the reduction is significant. This is derived from delays/memory. It also widens the scope, as some previous
Fig. 11, which shows the maximum and mean number of approaches [8] restrict to strides.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1158 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

Fig. 12. Cycles of the permutation σ (u 7 u 6 u 5 u 4 u 3 u 2 u 1 |u 0 ) =


u 6 u 1 u 0 u 3 u 5 u 7 u 2 |u 4 . The two cycles involve the dimensions {x7 x6 x2 x1 }
and {x5 x4 x3 x0 }, respectively.

Fig. 10. Circuit for the permutation σ (u 4 u 3 u 2 |u 1 u 0 ) = u 3 u 0 u 1 |u 4 u 2 .


(a) Solution using the proposed approach. (b) Input order. (c) Output order.

Fig. 13. Permutation σ (u 7 u 6 u 5 u 4 u 3 u 2 u 1 |u 0 ) =


u6 u1 u0 u3 u 5 u 7 u 2 |u 4 . (a) Problem with elevators for the cycle
with only serial dimensions. (b) Final permutation circuit.

of n − c = 8 − 2 = 6 EBEs, three for each cycle. Note


Fig. 11. Average and maximum number of delays of the proposed approach that the first cycle only involves serial dimensions, whereas
normalized to N as a function of the number of dimensions and the the second one involves serial and parallel dimensions.
parallelization. The latency and the number of delays and multiplexers can
be calculated as follows. Following Algorithm 1, the number
IX. C ONCLUSION of delays is Dmin = 166 according to
This paper has presented a new approach to design optimum
u7 u6 u5 u4 u3 u2 u1 u0
circuits for bit-dimension permutations. It consists in breaking
128 64 32 16 8 4 2 0
down any permutation into EBEs in an optimum way and,
− 4 128 8 0 16 2 64 32
then, implementing these EBEs with hardware circuits. (52)
124 −64 24 16 −8 2 −62 −32
In order to achieve optimum results, this paper analyzes the



cost of the bit-dimension permutations in terms of the number
−→ 166 ←−
of delays. A methodology to calculate this minimum number
of delays and obtain the corresponding circuit is proposed. From the total number of delays, the first cycle has 124 +
Comparison to previous approaches shows that the proposed 2 = 126 delays and the second cycle has 24 + 16 = 40. For
approach reduces the delays/memory with respect to the pre- clarity, in (52), the columns that correspond to the first cycle
vious memory-based approach and the number of multiplexers are highlighted.
with respect to previous delay-based approaches. The latency of the circuit is obtained from (35) as Latmin =
Dmin /P = 166/2 = 83 clock cycles.
A PPENDIX As the first cycle only involves serial dimensions, the num-
P RACTICAL C ASE ber of multiplexers is obtained from (44), leading to M(σ1 ) =
This section illustrates the entire procedure to design the (sC1 − 1)2P = (4 − 1) · 2 · 2 = 12. Likewise, the second
circuits for bit-dimension permutations. For this purpose, cycle involves serial and parallel dimensions and the number
we consider the permutation σ (u 7 u 6 u 5 u 4 u 3 u 2 u 1 |u 0 ) = of multiplexers is obtained from (45), leading to M(σ2 ) =
u 6 u 1 u 0 u 3 u 5 u 7 u 2 |u 4 . This permutation has p = 1 parallel sC2 P = 3 ·2 = 6. As a result, the total number of multiplexers
dimensions and, therefore, P = 2 p = 2. is M(σ ) = M(σ1 ) + M(σ2 ) = 12 + 6 = 18.
The permutation σ has two cycles that involve The next step is to calculate the EBEs of the permutation.
{x 7 x 6 x 2 x 1 } and {x 5 x 4 x 3 x 0 }, respectively, which is For the first cycle, we have the exercise with elevators shown
shown in Fig. 12. According to (41), the circuit consists in Fig. 13(a). One solution to this problem is the sequence

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1159

of EBEs (7 6), (2 1), and (6 2), which require 64, 2, and [18] Y. Chen, Y. W. Lin, Y. C. Tsao, and C. Y. Lee, “A 2.4-Gsample/s
60 delays, leading to the expected total of 126 delays. DVFS FFT processor for MIMO OFDM communication systems,” IEEE
J. Solid-State Circuits, vol. 43, no. 5, pp. 1260–1273, May 2008.
For the second cycle, there is only one parallel dimension, [19] Y. W. Lin and C. Y. Lee, “Design of an FFT/IFFT processor for MIMO
which we use as pivot dimension. According to Algorithm 3, OFDM systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54,
we obtain no. 4, pp. 807–815, Apr. 2007.
[20] M. Garrido, S.-J. Huang, S.-G. Chen, and O. Gustafsson, “The serial
0→5→3→4→0 commutator FFT,” IEEE Trans. Circuits Syst. II, Express Briefs, vol. 63,
(53) no. 10, pp. 974–978, Oct. 2016.
0 0 0 [21] M. Garrido, N. K. Unnikrishnan, and K. K. Parhi, “A serial commu-
tator fast Fourier transform architecture for real-valued signals,” IEEE
which leads to the sequence of EBEs (5 0), (3 0), and (4 0). Trans. Circuits Syst. II, Express Briefs, vol. 65, no. 11, pp. 1693–1697,
According to (27), this sequence of EBEs requires 32 + 8 + Nov. 2018.
16 − 8 − 8 = 40 delays, which corresponds to the expected [22] D. Akopian, J. Takala, J. Saarinen, and J. Astola, “Multistage intercon-
nection networks for parallel Viterbi decoders,” IEEE Trans. Commun.,
value. vol. 51, no. 9, pp. 1536–1545, Sep. 2003.
Finally, the obtained EBEs are implemented with the circuits [23] K. K. Parhi, “Systematic synthesis of DSP data format converters
in Section IV, leading to the circuit in Fig. 13(b). Note that using life-time analysis and forward-backward register allocation,” IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 7,
as the cycles are independent, the order of the circuits that pp. 423–440, Jul. 1992.
calculate the cycles can be exchanged. [24] M. Majumdar and K. K. Parhi, “Design of data format converters
using two-dimensional register allocation,” IEEE Trans. Circuits Syst. II,
Analog Digit. Signal Process., vol. 45, no. 4, pp. 504–508, Apr. 1998.
R EFERENCES [25] M. Püschel, P. A. Milder, and J. C. Hoe, “Permuting streaming data
using RAMs,” J. ACM, vol. 56, no. 2, Apr. 2009, Art. no. 10.
[1] D. Fraser, “Array permutation by index-digit permutation,” J. ACM, [26] R. Chen and V. K. Prasanna, “Automatic generation of high throughput
vol. 23, no. 2, pp. 298–309, Apr. 1976. energy efficient streaming architectures for arbitrary fixed permutations,”
[2] H. S. Stone, “Parallel processing with the perfect shuffle,” IEEE Trans. in Proc. 25th Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2015,
Comput., vol. C-20, no. 2, pp. 153–161, Feb. 1971. pp. 1–8.
[3] M. Garrido, “Efficient hardware architectures for the computation of [27] F. Serre, T. Holenstein, and M. Püschel, “Optimal circuits for streamed
the FFT and other related signal processing algorithms in real time,” linear permutations using RAM,” in Proc. ACM/SIGDA Int. Symp. Field-
Ph.D. dissertation, Dept. Signal, Syst. Radiocommun., Univ. Politécnica Program. Gate Arrays, Feb. 2016, pp. 215–223.
Madrid, Madrid, Spain, Dec. 2009. [28] J. H. Takala, T. S. Jàrvinen, and H. T. Sorokin, “Conflict-free parallel
[4] I. D. Lotto and D. Dotti, “Large-matrix-ordering technique with applica- memory access scheme for FFT processors,” in Proc. Int. Symp. Circuits
tions to transposition,” Electron. Lett., vol. EL-9, no. 16, pp. 374–375, Syst., vol. 4, May 2003, pp. IV-524–IV-527.
Aug. 1973. [29] J. Takala and T. Járvinen, Stride Permutation Access in Interleaved
[5] J. Granata, M. Conner, and R. Tolimieri, “Recursive fast algorithm and Memory Systems, (Domain-Specific Processors: Systems, Architec-
the role of the tensor product,” IEEE Trans. Signal Process., vol. 40, tures, Modeling, and Simulation), S. Bhattacharyya, E. Deprettere, and
no. 12, pp. 2921–2930, Dec. 1992. J. Teich, Eds. Boca Raton, FL, USA: CRC Press, 2003.
[6] T. Járvinen, “Systematic methods for designing stride permutation [30] T. Koehn and P. Athanas, “Arbitrary streaming permutations with
interconnections,” Ph.D. dissertation, Inst. Digit. Comput. Syst., Tampere minimum memory and latency,” in Proc. IEEE/ACM Int. Conf. Comput.-
Univ. Technol., Tampere, Finland, Nov. 2004. Aided Design, Nov. 2016, pp. 1–6.
[7] T. Járvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation [31] T. E. Koehn, “Automatic generation of efficient parallel streaming
networks for array processors,” in Proc. 15th IEEE Int. Conf. Appl.- structures for hardware implementation,” Ph.D. dissertation, Dept. Elect.
Specific Syst., Architectures Processors, Sep. 2004, pp. 376–386. Eng., Virginia Polytech. Inst., Blacksburg, VA, USA, Nov. 2016.
[8] T. Járvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation [32] A. Edelman, S. Heller, and S. L. Johnsson, “Index transformation
networks for array processors,” J. VLSI Signal Process. Syst., vol. 49, algorithms in a linear algebra framework,” IEEE Trans. Parallel Distrib.
no. 1, pp. 51–71, Oct. 2007. Syst., vol. 5, no. 12, pp. 1302–1309, Dec. 1994.
[9] M. Garrido, J. Grajal, and O. Gustafsson, “Optimum circuits for bit [33] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson, “Challenging
reversal,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10, the limits of FFT performance on FPGAs,” in Proc. Int. Symp. Integr.
pp. 657–661, Oct. 2011. Circuits (ISIC), Dec. 2014, pp. 172–175.
[10] C. Cheng and F. Yu, “An optimum architecture for continuous-flow [34] T. Ahmed, M. Garrido, and O. Gustafsson, “A 512-point 8-parallel
parallel bit reversal,” IEEE Signal Process. Lett., vol. 22, no. 12, pipelined feedforward FFT for WPAN,” in Proc. 45th Asilomar Conf.
pp. 2334–2338, Dec. 2015. Signals, Syst. Comput. (ASILOMAR), Nov. 2011, pp. 981–984.
[11] W. Li, F. Yu, and Z. Ma, “Efficient circuit for parallel bit reversal,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 4, pp. 381–385,
Apr. 2016.
[12] C.-M. Chen, C.-C. Hung, and Y.-H. Huang, “An energy-efficient par-
tial FFT processor for the OFDMA communication system,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 2, pp. 136–140,
Feb. 2010.
[13] S.-G. Chen, S.-J. Huang, M. Garrido, and S.-J. Jou, “Continuous-flow
parallel bit-reversal circuit for MDF and MDC FFT architectures,” IEEE Mario Garrido (M’07) received the M.S. degree
Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 10, pp. 2869–2877, in electrical engineering and the Ph.D. degree from
Oct. 2014. the Technical University of Madrid, Madrid, Spain,
[14] R. Chen and V. K. Prasanna, “Optimal circuits for parallel bit reversal,” in 2004 and 2009, respectively.
in Proc. 54th ACM/EDAC/IEEE Des. Automat. Conf. (DAC), Jun. 2017, In 2010, he joined the Department of Electri-
pp. 1–6. cal Engineering, Linköping University, Linköping,
[15] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order Sweden, as a Postdoctoral Researcher, where he has
pipeline FFT design,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, been an Associate Professor since 2012. His current
no. 12, pp. 1234–1238, Dec. 2008. research interests include optimized hardware design
[16] M. Garrido, J. Grajal, and M. A. Sanchez, and O. Gustafsson, “Pipelined for signal processing applications, particularly the
radix-2k feedforward FFT architectures,” IEEE Trans. Very Large Scale design of hardware architectures for the calculation
Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23–32, Jan. 2013. of transforms, such as the fast Fourier transform, circuits for data management,
[17] M. Garrido, S. J. Huang, and S. G. Chen, “Feedforward FFT hardware the CORDIC algorithm, and circuits to calculate statistical and mathematical
architectures based on rotator allocation,” IEEE Trans. Circuits Syst. I, operations, and high-performance circuits for real-time computation, and
Reg. Papers, vol. 65, no. 2, pp. 581–592, Feb. 2018. designs for small area and low power consumption.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1160 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019

Jesús Grajal (M’17) was born in León, Spain, Oscar Gustafsson (S’98–M’03–SM’10) received
in 1967. He received the Ingeniero de Telecomu- the M.Sc., Ph.D., and Docent degrees from
nicación and Ph.D. degrees from the Technical Uni- Linköping University, Linköping, Sweden, in 1998,
versity of Madrid, Madrid, Spain, in 1992 and 1998, 2003, and 2008, respectively.
respectively. He is currently an Associate Professor and
He is currently a Professor at the Signals, Systems, the Head of the Computer Engineering Division,
and Radio Communications Department, Technical Department of Electrical Engineering, Linköping
School of Telecommunication Engineering, Techni- University. His current research interests include the
cal University of Madrid. His current research inter- design and implementation of DSP algorithms and
ests include hardware-design for radar systems, radar arithmetic circuits. He has authored and coauthored
signal processing and broadband digital receivers for more than 140 papers in international journals and
radar, and spectrum surveillance applications. conferences on these topics.
Dr. Gustafsson is a member of the VLSI Systems and Applications and
the Digital Signal Processing Technical Committees of the IEEE Circuits
and Systems Society. He has served and serves in various positions for
conferences such as ISCAS, PATMOS, PrimeAsia, Asilomar, Norchip,
ECCTD, and ICECS. He currently serves as an Associate Editor for the
IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —PART II: E XPRESS
B RIEFS AND I NTEGRATION and VLSI Journal.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.

You might also like