Professional Documents
Culture Documents
5, MAY 2019
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1149
II. B IT-D IMENSION P ERMUTATIONS Fig. 2. Definition of serial and parallel dimensions in a data flow.
Let us consider a set of N = data, n ∈ N, in an
2n
n-dimensional space x n−1 x n−2 . . . x 0 , where x i ∈ {0, 1}. In this Fig. 2 shows the definition of serial and parallel dimensions
context, a bit-dimension permutation, σ , defines a reordering in a data flow. As a convention, data flow from left to right,
of the data according to a permutation of the coordinates in x 0 to x p−1 are parallel dimensions and x p to x n−1 are serial
the space [1]. This allows for defining the permutation on a ones. This means that there are p parallel dimensions and
set of n elements instead of defining it for 2n values, which n − p serial dimensions. This also means that the data flow
is, most of the times, mathematically inaccessible [32]. is modeled as a rectangle of 2 p data in parallel times 2n− p
In general, a bit-dimension permutation is a permutation in series.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1151
Permutations of serial data are used for bit reversal [9] and Examples of serial–parallel permutations can be found in
for the serial commutator FFT [20], [21]. the parallel feedforward FFT architectures [16], [17].
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1152 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
For any bit-dimension permutation, the number of delays σ3 ◦ σ1 , and Lat = 6/21 = 3 for σ2 ◦ σ3 . Finally, the number
is calculated from the cost of the EBEs that it consists of, of multiplexers is
according to
M(σ1 ◦ σ2 ) = M(σ1 ) + M(σ2 ) = 2 + 4 = 6
Q
D(σ ) = D(σq ) − min(D(σr ), D(σr+1 )) (27) M(σ3 ◦ σ1 ) = M(σ3 ) + M(σ1 ) = 2 + 2 = 4
q=1 r M(σ2 ◦ σ3 ) = M(σ2 ) + M(σ3 ) = 4 + 2 = 6. (31)
where Q is the total number of EBEs and r corresponds to the As a result, the permutation with the lowest cost is σ3 ◦ σ1 .
values for which σr and σr+1 are both serial–parallel EBEs that
share the parallel dimension. The latency of a bit-dimension VI. T HEORETICAL L IMITS
permutation is related to the number of delays by the number
of parallel samples P = 2 p , i.e., Lat = D(σ )/2 p . Finally, A. Minimum Latency
the number of multiplexers is equal to the sum of the number The latency of a permutation circuit is the difference
of multiplexers of the individual permutations, that is between the time when the first input arrives to the circuit
and the time when the first output is provided.
Q
The time that a certain input is inside the permutation
M(σ ) = M(σq ). (28) circuit, t I , is equal to the time of departure, t (P1 ), minus
q=1
the time of arrival, t (P0 ), plus the circuit latency. Note that
Example: The perfect shuffle σ (u 2 u 1 |u 0 ) = u 1 u 0 |u 2 can the time of arrival/departure defined in Section III is referred
be calculated in three ways depending on how we break down to the arrival/departure of the first input–output. This gives
the permutation, i.e., σ = σ1 ◦ σ2 = σ3 ◦ σ1 = σ2 ◦ σ3 , where
t I = t (P1 ) − t (P0 ) + Lat ≥ 0 (32)
σ1 : x 1 ↔ x 0 , σ2 : x 2 ↔ x 1 and σ3 : x 2 ↔ x 0 are EBEs. The
permutations σ1 and σ3 are serial–parallel and the permutation where the time t I needs to be greater than or equal to zero to
σ2 is serial–serial. According to Table I and considering that make the circuit causal. This leads to
p = 1, the number of delays and multiplexers of the EBEs is
Lat ≥ t (P0 ) − t (P1 ) ∀ data. (33)
D(σ1 ) = 21 = 2, M(σ1 ) = 2 Therefore, the minimum latency is
D(σ2 ) = 22 − 21 = 2, M(σ2 ) = 4
Latmin = max(t (P0 ) − t (P1 )). (34)
D(σ3 ) = 22 = 4, M(σ3 ) = 2. (29)
Consequently, the minimum latency is set by the datum
By implementing the EBEs according to Section IV, that arrives the latest with respect to the time in which it
the three circuits to calculate σ in Fig. 6 are obtained. The should be provided, which will force the rest of data to wait
number of delays for the three implementations according for it.
to (27) is
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1153
By substituting (6) in (35) and taking into account that Algorithm 1 Theoretical Minimum Number of Delays
x i = u i at the input and x i = u σ (i) at the output according
to (1), we obtain
⎛ ⎞
n−1
n−1
Dmin = max ⎝ u i 2i − u σ (i) 2i ⎠. (36)
i= p i= p
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1154 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
Fig. 7. All the cases in which two elevators moving in different directions can be.
problem with elevators. Once the problem of elevators is Algorithm 2 Obtaining the Optimum Permutation for Cycles
understood, it is easy to apply it to our optimization problem. With Only Serial Dimensions
Let us assume that a building has n elevators that can move
between floors F = 0 and F = n − 1. Each elevator can be
in any of the n floors. However, there is always one elevator
at each floor.
Each elevator has a number. This number corresponds to
the floor that the elevator must reach.
Elevators can move. The movement is done in pairs of
elevators that change floor. This exchange is done to respect
the rule of one elevator per floor. The cost of moving one
elevator along several floors depends on the initial and final
floors. As long as the elevator moves toward its final floor,
the cost will be the same independent of the number of stops
in intermediate floors. However, if an elevator gets further than
its target floor, there will be an extra cost. 2) Optimizing Cycles With Only Serial Dimensions: The
For this problem, we want to calculate the most efficient optimization of cycles with only serial dimensions is done by
movements in order to make all the elevators reach their translating it to a problem of the elevators. This is possible
destination floors. because for serial–serial permutations, the cost of moving up
Solution: As pairs of elevators exchange floors, in order to (analogously down) any u i from x k to x j is the same when
move them, it is necessary that one of them moves down and it is done directly, i.e., D = 2 j − 2k , and when there are
the other one moves up. It is also necessary that each of them intermediate stops, e.g., by stopping in h, j > h > k, the cost
reaches the floor where the other one was. is D = (2h − 2k ) + (2 j − 2h ) = 2 j − 2k as before.
Fig. 7 shows all the cases in which two elevators moving Once the problem has been translated into a problem with
in different directions can be. The elevator j is in floor F( j ) elevators, the next step is to identify allowed movements that
and aims to reach floor j . The elevator k is in floor F(k) and respect the properties j ≥ F(k), F(k) > F( j ), and F( j ) ≥ k,
aims to reach floor k. Without loss of generality, we consider which guarantee the minimum cost. Each of these movements
that j > k. This means that the elevator j moves up and leads to a new building and each of these buildings creates a
the elevator k moves down. In other words, it is fulfilled that branch of a tree.
F( j ) < j and F(k) > k. Then, the process repeats for each branch of the tree and
Among the cases in Fig. 7, in (a), (b), and (c), j cannot reach continues until all the elevators reach their destination.
F(k) without surpassing its destination. In cases (c), (e), and In the end, each branch of the tree represents an optimum
(m), k cannot reach F( j ) without surpassing its destination. permutation. The sequence of permutations to reach one
In case (o), a swap of the floors would only make the elevators optimum is obtained by following the tree from the top to
further than their destination. Therefore, the cases that advance the end of any branch.
to the destinations without incurring in additional costs are the If, instead of obtaining all optimum solutions, we only need
cases (d), (g), (h), and (n). These cases share the properties: one of them, Algorithm 2 obtains such permutation. This
j ≥ F(k), F(k) > F( j ), and F( j ) ≥ k. Therefore, the mini- algorithm searches for feasible movements of the elevators
mum cost is achieved as long as the movements fulfill these and, when it finds one, it collects the EBE, does the swap
properties. corresponding to that EBE and continues from that point,
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1155
i.e., it does not search all the optimum cases but follows the movements. Any step of the algorithm consists of one or more
first that it finds. cycles. If a given cycle involves only two dimensions, then it
Example: Let us consider σ (u 4 u 3 u 2 u 1 u 0 ) = will be equal to the case in Fig. 7(g). If the cycle involves
u 1 u 4 u 0 u 2 u 3 . First, it is translated into the building with more than two dimensions, then the upper part of the cycle
elevators at the top of Fig. 8. The floor in which each elevator must look like the case in Fig. 7(d) or (e). They, together
starts is equal to the numbers on the top of the building, which with Fig. 7(g), are the only cases that can create the upper
corresponds to the subindex i of u i at the output of the given part of the cycle. The case in Fig. 7(d) is already one of the
permutation. valid movements. For Fig. 7(e), if F( j ) is the lowest floor
An allowed movement is to swap the elevators 4 and 0, in the cycle, i.e., F( j ) = Fmin , the elevator j will go from
which are in floors 3 and 1, respectively. Another allowed the lowest to the highest floor. This forces that the bottom
movement is to swap the elevators 4 and 1, which are in floors part of the cycle is closed with Fig. 7(h), which is a valid
3 and 2, respectively. These two cases lead to the two buildings movement. In Fig. 7(e), if F( j ) > Fmin , there will exist an
to the sides of the top one. Then, the process repeats for each elevator h < F( j ) that comes from F(h) > F( j ), which
of the resulting buildings, until the tree is finished. allows for reaching the floors under F( j ). Otherwise, the cycle
In the end, any branch of the tree represents an optimum could not be closed. In this case, we can apply the movement
movement. For instance, by going from the top to the most in Fig. 7(n) to the elevators j and h. Therefore, any step of
left branch, we obtain the EBEs (3 1), (1 0), (2 1), and (4 3). the algorithm has at least a valid movement. This guarantees
It can be checked that this sequence of EBEs carries out the that the algorithms always reach the optimum permutations.
desired permutation and its cost in terms of delays is 17, which
corresponds to the theoretical minimum in Section VI.
3) Proof of Optimality: To proof optimality, we know C. Cycles With Serial and Parallel Dimensions
that in our solution, we only consider the movements 1) Optimizing Cycles With Serial and Parallel Dimensions:
in Fig. 7(d), (g), (h), and (n). All these movements move ele- When a cycle includes serial and parallel dimensions, the cir-
vators closer to their final floor and guarantee that the final cost cuit is optimized by using one of the parallel dimensions as
is optimum since no cost apart from the minimum is intro- a pivot. Thus, all the EBEs are carried out between the pivot
duced. Furthermore, by following any of these movements, dimension and another dimension. This transforms all serial–
the final cost is the same, as it is independent on the stops serial EBEs into serial–parallel, which follows the ideas in
in the intermediate floor. What remains to proof is that at any Section V and results in less multiplexers and equal or less
step of the algorithm, we can always apply at least one of these delays than using serial–serial permutations.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1156 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
TABLE II
C OST OF IN T ERMS OF D ELAYS /M EMORY AND M ULTIPLEXERS OF S EVERAL B IT-D IMENSION P ERMUTATIONS
Algorithm 3 Obtaining the Optimum Permutation for Cycles where sC is the number of serial dimensions in the cycle and
With at Least One Parallel Dimension sC − 1 is equal to the number of EBEs in the cycle.
In cycles with at least one parallel dimension, the optimum
permutation consists of a sequence of serial–parallel permuta-
tions. These permutations require one multiplexer per parallel
branch and per serial–parallel permutation, that is
M = sC P. (45)
Based on this, for any bit-dimension permutation, the upper
bound for the number of multiplexers in the proposed
approach is
N
MUB = 2P log2 . (46)
2P
The order of the EBEs that must be carried out is obtained VIII. C OMPARISON
easily. Starting with the pivot dimension, the value u i is Table II compares the cost of several bit-dimension permu-
moved to its corresponding place at the output. This not only tations in terms of delays/memory and multiplexers. The first
allocates u i in its place but also moves another u i
to the pivot column shows the references. The second column shows if
dimension. Next, u i
is moved to its corresponding place at the the approach is memory-based or delay-based. The remaining
output, and a new u i
is moved to the pivot dimension. The columns show the cost in terms of delays/memory and two-
procedure continues in the same way until all the values reach input multiplexers for the permutations (A), (B), (C), and (D)
their place. Note that i
= σ −1 (i ) and i
= σ −2 (i ) according under study.
to the definition in Section II.
The previous procedure results in Algorithm 3. An example √ Case (A) is a P × P matrix transposition with P =
N , which corresponds to σ (u n−1 . . . u n/2 |u n/2−1 . . . u 0 ) =
of the application of this algorithm is given in the Appendix. u n/2−1 . . . u 0 |u n−1 . . . u n/2 . It is implemented with n/2 serial–
2) Proof of Optimality: All the resulting permutations are parallel EBEs σ : x i ↔ x i+n/2 , i = 0, . . . , n/2 − 1, and its
serial–parallel or parallel–parallel. By including the costs cost is
in Table I in (27), the cost of the resulting permutation is
n/2−1
√
D= 2i − min 2i , 2i (42) D(σ ) = 2i+n/2 = N − N (47)
i≥ p i≥ p i=0
i
≥ p
n/2−1
√
D= 2i + 2i − 2i − 2i . (43) M(σ ) = 2 p = P log2 P (given that P = N ). (48)
i≥ p i≥ p i
>i≥ p i>i
≥ p i=0
i≥i
i<i
For this permutation, the proposed approach and other
As = i
σ −1 (i ),
this corresponds to the minimum number of delay-based approaches in Table II have less complexity
delays in (38). than memory-based approaches either in the amount of
delays/memory, or the number of multiplexers, or both.
D. Number of Multiplexers The permutation (B) is a bit reversal of N data arriving
in P parallel streams with N > P 2 . By using the proposed
In cycles that only include serial dimensions, all the EBEs
approach, the bit reversal is broken down into the EBEs σi :
are serial–serial. Therefore, each serial path includes two
x i ↔ x n−1−i , i = 0, . . . , n/2 − 1.
multiplexers per EBE, leading to a total of
For either parallel data with N > P 2 or serial data,
M = (sC − 1)2P (44) the permutation consists of p serial–parallel EBEs σi : x i ↔
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1157
which results in
⎧ √
⎨N − 2 N +
P, n even
D(σ ) = √ N (50)
⎩ N − 2N − + P, n odd
2
and
⎧
⎪
⎪ N
⎨ P log2 , n even
M(σ ) = p2 p +n/2 − p2 p+1 = P
⎪
⎪ N
⎩ P log2 , n odd.
2P
(51)
The resulting circuits for serial and parallel bit reversal using
the proposed approach are the same as those in [9] and [10],
respectively. Therefore, the proposed approach is capable of
obtaining optimum circuits for bit reversal for any P, and [9]
and [10] are only specific cases of the framework provided in
this paper.
Also, note that the cost of the bit-reversal permutation, both
for serial and parallel data, corresponds to the upper bound
defined in (40). This means that the bit reversal is the most
costly bit-dimension permutation.
Compared to previous approaches in Table II, the proposed
approach requires less delays/memory than previous memory-
based approaches. As we consider N > P 2 , the P log2 P
multiplexers in [27] are less than the P log2 (N/P) multiplex-
ers of the proposed approach. Therefore, there is a tradeoff
between delays/memory and multiplexers.
The permutation (C) is σ (u 4 u 3 u 2 u 1 |u 0 ) = u 2 u 1 u 0 u 4 |u 3 ,
which is a stride permutation that has been used in [8].
Fig. 9(b) and (c) shows the proposed solution and the timing
diagram, respectively. In this case, memory-based approaches
require less multiplexers at the cost of noticeably more
delays/memory. All delay-based approaches require the theo-
retical minimum amount of delays/memory, and the proposed
approach requires the least amount of multiplexers among Fig. 9. Circuits for the permutation σ (u 4 u 3 u 2 u 1 |u 0 ) =
them. u 2 u 1 u 0 u 4 |u 3 . (a) Using the theory of stride permutations in [8, Fig. 14(f)].
(b) Using the proposed approach. (c) Timing diagram of the proposed
The permutation (D) is σ (u 4 u 3 u 2 |u 1 u 0 ) = u 3 u 0 u 1 |u 4 u 2 , approach.
which is not a stride permutation. The proposed solution
is shown in Fig. 10. In this case, the proposed approach
saves 68% of the memory and uses 50% more multiplexers delays of the proposed circuits among all the permutations
with respect to those in [25] and [27] and saves 37% of with the corresponding dimensions and parallelization, nor-
the memory plus 25% of the multiplexers with respect to malized to N. As some memory-based approaches [26]–[29]
that in [26] and [27]. Previous delay-based approaches do not require a total memory of N, the values of the graph corre-
consider this permutation [8] or require a large number of spond to the ratio between the delays/memory of the proposed
multiplexers [24]. approach and in those memory-based approaches.
Finally, there are some general conclusions. On the one On the other hand, the proposed approach reduces the
hand, the proposed approach reduces the memory requirement number of multiplexers compared to previous delay-based
with respect to memory-based approaches, and in most of approaches [8], [24] while having the minimum number of
the cases, the reduction is significant. This is derived from delays/memory. It also widens the scope, as some previous
Fig. 11, which shows the maximum and mean number of approaches [8] restrict to strides.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1158 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
GARRIDO et al.: OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATIONS 1159
of EBEs (7 6), (2 1), and (6 2), which require 64, 2, and [18] Y. Chen, Y. W. Lin, Y. C. Tsao, and C. Y. Lee, “A 2.4-Gsample/s
60 delays, leading to the expected total of 126 delays. DVFS FFT processor for MIMO OFDM communication systems,” IEEE
J. Solid-State Circuits, vol. 43, no. 5, pp. 1260–1273, May 2008.
For the second cycle, there is only one parallel dimension, [19] Y. W. Lin and C. Y. Lee, “Design of an FFT/IFFT processor for MIMO
which we use as pivot dimension. According to Algorithm 3, OFDM systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54,
we obtain no. 4, pp. 807–815, Apr. 2007.
[20] M. Garrido, S.-J. Huang, S.-G. Chen, and O. Gustafsson, “The serial
0→5→3→4→0 commutator FFT,” IEEE Trans. Circuits Syst. II, Express Briefs, vol. 63,
(53) no. 10, pp. 974–978, Oct. 2016.
0 0 0 [21] M. Garrido, N. K. Unnikrishnan, and K. K. Parhi, “A serial commu-
tator fast Fourier transform architecture for real-valued signals,” IEEE
which leads to the sequence of EBEs (5 0), (3 0), and (4 0). Trans. Circuits Syst. II, Express Briefs, vol. 65, no. 11, pp. 1693–1697,
According to (27), this sequence of EBEs requires 32 + 8 + Nov. 2018.
16 − 8 − 8 = 40 delays, which corresponds to the expected [22] D. Akopian, J. Takala, J. Saarinen, and J. Astola, “Multistage intercon-
nection networks for parallel Viterbi decoders,” IEEE Trans. Commun.,
value. vol. 51, no. 9, pp. 1536–1545, Sep. 2003.
Finally, the obtained EBEs are implemented with the circuits [23] K. K. Parhi, “Systematic synthesis of DSP data format converters
in Section IV, leading to the circuit in Fig. 13(b). Note that using life-time analysis and forward-backward register allocation,” IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 7,
as the cycles are independent, the order of the circuits that pp. 423–440, Jul. 1992.
calculate the cycles can be exchanged. [24] M. Majumdar and K. K. Parhi, “Design of data format converters
using two-dimensional register allocation,” IEEE Trans. Circuits Syst. II,
Analog Digit. Signal Process., vol. 45, no. 4, pp. 504–508, Apr. 1998.
R EFERENCES [25] M. Püschel, P. A. Milder, and J. C. Hoe, “Permuting streaming data
using RAMs,” J. ACM, vol. 56, no. 2, Apr. 2009, Art. no. 10.
[1] D. Fraser, “Array permutation by index-digit permutation,” J. ACM, [26] R. Chen and V. K. Prasanna, “Automatic generation of high throughput
vol. 23, no. 2, pp. 298–309, Apr. 1976. energy efficient streaming architectures for arbitrary fixed permutations,”
[2] H. S. Stone, “Parallel processing with the perfect shuffle,” IEEE Trans. in Proc. 25th Int. Conf. Field Program. Logic Appl. (FPL), Sep. 2015,
Comput., vol. C-20, no. 2, pp. 153–161, Feb. 1971. pp. 1–8.
[3] M. Garrido, “Efficient hardware architectures for the computation of [27] F. Serre, T. Holenstein, and M. Püschel, “Optimal circuits for streamed
the FFT and other related signal processing algorithms in real time,” linear permutations using RAM,” in Proc. ACM/SIGDA Int. Symp. Field-
Ph.D. dissertation, Dept. Signal, Syst. Radiocommun., Univ. Politécnica Program. Gate Arrays, Feb. 2016, pp. 215–223.
Madrid, Madrid, Spain, Dec. 2009. [28] J. H. Takala, T. S. Jàrvinen, and H. T. Sorokin, “Conflict-free parallel
[4] I. D. Lotto and D. Dotti, “Large-matrix-ordering technique with applica- memory access scheme for FFT processors,” in Proc. Int. Symp. Circuits
tions to transposition,” Electron. Lett., vol. EL-9, no. 16, pp. 374–375, Syst., vol. 4, May 2003, pp. IV-524–IV-527.
Aug. 1973. [29] J. Takala and T. Járvinen, Stride Permutation Access in Interleaved
[5] J. Granata, M. Conner, and R. Tolimieri, “Recursive fast algorithm and Memory Systems, (Domain-Specific Processors: Systems, Architec-
the role of the tensor product,” IEEE Trans. Signal Process., vol. 40, tures, Modeling, and Simulation), S. Bhattacharyya, E. Deprettere, and
no. 12, pp. 2921–2930, Dec. 1992. J. Teich, Eds. Boca Raton, FL, USA: CRC Press, 2003.
[6] T. Járvinen, “Systematic methods for designing stride permutation [30] T. Koehn and P. Athanas, “Arbitrary streaming permutations with
interconnections,” Ph.D. dissertation, Inst. Digit. Comput. Syst., Tampere minimum memory and latency,” in Proc. IEEE/ACM Int. Conf. Comput.-
Univ. Technol., Tampere, Finland, Nov. 2004. Aided Design, Nov. 2016, pp. 1–6.
[7] T. Járvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation [31] T. E. Koehn, “Automatic generation of efficient parallel streaming
networks for array processors,” in Proc. 15th IEEE Int. Conf. Appl.- structures for hardware implementation,” Ph.D. dissertation, Dept. Elect.
Specific Syst., Architectures Processors, Sep. 2004, pp. 376–386. Eng., Virginia Polytech. Inst., Blacksburg, VA, USA, Nov. 2016.
[8] T. Járvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation [32] A. Edelman, S. Heller, and S. L. Johnsson, “Index transformation
networks for array processors,” J. VLSI Signal Process. Syst., vol. 49, algorithms in a linear algebra framework,” IEEE Trans. Parallel Distrib.
no. 1, pp. 51–71, Oct. 2007. Syst., vol. 5, no. 12, pp. 1302–1309, Dec. 1994.
[9] M. Garrido, J. Grajal, and O. Gustafsson, “Optimum circuits for bit [33] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson, “Challenging
reversal,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10, the limits of FFT performance on FPGAs,” in Proc. Int. Symp. Integr.
pp. 657–661, Oct. 2011. Circuits (ISIC), Dec. 2014, pp. 172–175.
[10] C. Cheng and F. Yu, “An optimum architecture for continuous-flow [34] T. Ahmed, M. Garrido, and O. Gustafsson, “A 512-point 8-parallel
parallel bit reversal,” IEEE Signal Process. Lett., vol. 22, no. 12, pipelined feedforward FFT for WPAN,” in Proc. 45th Asilomar Conf.
pp. 2334–2338, Dec. 2015. Signals, Syst. Comput. (ASILOMAR), Nov. 2011, pp. 981–984.
[11] W. Li, F. Yu, and Z. Ma, “Efficient circuit for parallel bit reversal,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 4, pp. 381–385,
Apr. 2016.
[12] C.-M. Chen, C.-C. Hung, and Y.-H. Huang, “An energy-efficient par-
tial FFT processor for the OFDMA communication system,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 2, pp. 136–140,
Feb. 2010.
[13] S.-G. Chen, S.-J. Huang, M. Garrido, and S.-J. Jou, “Continuous-flow
parallel bit-reversal circuit for MDF and MDC FFT architectures,” IEEE Mario Garrido (M’07) received the M.S. degree
Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 10, pp. 2869–2877, in electrical engineering and the Ph.D. degree from
Oct. 2014. the Technical University of Madrid, Madrid, Spain,
[14] R. Chen and V. K. Prasanna, “Optimal circuits for parallel bit reversal,” in 2004 and 2009, respectively.
in Proc. 54th ACM/EDAC/IEEE Des. Automat. Conf. (DAC), Jun. 2017, In 2010, he joined the Department of Electri-
pp. 1–6. cal Engineering, Linköping University, Linköping,
[15] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order Sweden, as a Postdoctoral Researcher, where he has
pipeline FFT design,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, been an Associate Professor since 2012. His current
no. 12, pp. 1234–1238, Dec. 2008. research interests include optimized hardware design
[16] M. Garrido, J. Grajal, and M. A. Sanchez, and O. Gustafsson, “Pipelined for signal processing applications, particularly the
radix-2k feedforward FFT architectures,” IEEE Trans. Very Large Scale design of hardware architectures for the calculation
Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23–32, Jan. 2013. of transforms, such as the fast Fourier transform, circuits for data management,
[17] M. Garrido, S. J. Huang, and S. G. Chen, “Feedforward FFT hardware the CORDIC algorithm, and circuits to calculate statistical and mathematical
architectures based on rotator allocation,” IEEE Trans. Circuits Syst. I, operations, and high-performance circuits for real-time computation, and
Reg. Papers, vol. 65, no. 2, pp. 581–592, Feb. 2018. designs for small area and low power consumption.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.
1160 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 5, MAY 2019
Jesús Grajal (M’17) was born in León, Spain, Oscar Gustafsson (S’98–M’03–SM’10) received
in 1967. He received the Ingeniero de Telecomu- the M.Sc., Ph.D., and Docent degrees from
nicación and Ph.D. degrees from the Technical Uni- Linköping University, Linköping, Sweden, in 1998,
versity of Madrid, Madrid, Spain, in 1992 and 1998, 2003, and 2008, respectively.
respectively. He is currently an Associate Professor and
He is currently a Professor at the Signals, Systems, the Head of the Computer Engineering Division,
and Radio Communications Department, Technical Department of Electrical Engineering, Linköping
School of Telecommunication Engineering, Techni- University. His current research interests include the
cal University of Madrid. His current research inter- design and implementation of DSP algorithms and
ests include hardware-design for radar systems, radar arithmetic circuits. He has authored and coauthored
signal processing and broadband digital receivers for more than 140 papers in international journals and
radar, and spectrum surveillance applications. conferences on these topics.
Dr. Gustafsson is a member of the VLSI Systems and Applications and
the Digital Signal Processing Technical Committees of the IEEE Circuits
and Systems Society. He has served and serves in various positions for
conferences such as ISCAS, PATMOS, PrimeAsia, Asilomar, Norchip,
ECCTD, and ICECS. He currently serves as an Associate Editor for the
IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —PART II: E XPRESS
B RIEFS AND I NTEGRATION and VLSI Journal.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on September 07,2020 at 15:24:47 UTC from IEEE Xplore. Restrictions apply.