Primecore 20100222

Design of an ASIP for DFT/FFT
Erik Brockmeyer
Target Compiler Technologies
Technologielaan 11-0002, B-3001 Leuven, Belgium
Februari, 2010
Version: 0.1
Abstract
The Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) are typical functions used
in the implementation of wireless standards. Examples of such standards are WiMax and LTE. These
standards support a wide range of symbol sizes that all require a different size DFT/FFT. Especially
the DFT algorithm can become complex if an efficient implementation is required. The Prime Factor
Algorithm (PFA) algorithm can efficiently implement a DFT of which the size can be expressed as a
product of prime-factors. Unfortunately, the modules of which this algorithm consists are irregular.
Therefore, a flexible and programmable architecture is key to enable all these standards to be build
efficiently. Unfortunately, a standard processor does not have sufficient compute power to deliver a high
enough throughput. An Application Specific Processor (ASIP) allows to tune your processor for a certain
task such that it can meet the throughput [6].
The designed ASIP, named Primecore, is capable of delivering a throughput of 1 input sample per
cycle at a frequency of 500MHz. This result has been obtained by vectorizing the code and reducing the
loop overhead to a minimum. Special instructions have been added to efficiently execute the butterfly
patterns in a Single Instruction Multiple Data (SIMD) style. Also special load and store instructions
have been constructed to realize a memory bandwidth that matches the requirements of the data path.
Instruction Level Parallelism (ILP) was used to combine the data path and load/store operations into
parallel instructions. The Chess retargetable C compiler supports the programming of this application
specific architecture and exploits the available ILP.
The total gate count of the Primecore design, obtained for a clock frequency of 500 MHz and for a
65 nm technology, is 340kgates.
February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

CONTENTS 2
Contents
1 Introduction 3
2 Application 3
2.1 DFT/FFT sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Algorithm development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Scalar operation count estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Special operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Reducing load/stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Vector operation count estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Fixed point and data scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 Source code structure of modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 Guiding compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Architecture 12
3.1 Overall structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 15
4.1 Cycle count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Code size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Signal to noise ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Evaluation 19
5.1 Native compilation and execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 On target compilation and execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Verilog generation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Future work 21
7 Summary 21
Glossary 22
Bibliography 23

2 APPLICATION 3
1 Introduction
The primecore has been developed for use in the LTE and 4G base stations. These stations require a high
throughput for algorithms like Fast Fourier Transform (FFT) and Discrete Fourier Transform (DFT). The
target throughput is 1 cycle per complex input sample. At the same time, the architecture should be able to
run at a high clock frequency (500Mhz).
The uplink transmission scheme selected for Long Term Evolution (LTE) is SC-FDMA, also known as DFT-
spread Orthogonal frequency-division multiplexing (OFDM). The DFT is one of the most demanding blocks
in the communication scheme. A straight forward implementation would require many parallel multipliers
to meet real-time. It is known that efficient implementation of a DFT is possible if the size of the transform
can be factorized in to a small number of (prime) numbers (Prime Factor Algorithm (PFA) [2]). The PFA
algorithm was envisioned at the time when the LTE standard was defined. The PFA requires much less
resources, but has a much less regular control flow and consists of many different cases. Therefore an
Application Specific Processor (ASIP) looks like a promising solution to implement this type of algorithm.
The remainder of this document is build up as follows. Section 2 will discuss the application, its charac-
teristics and the performed optimizations. Section 3 explains the primecore architecture. Section 4 reveals
the cycle count and gate count results. Some guidance to evaluate the primecore and IP designer is given in
Section 5. Open issues and future work are discussed in Section 6. Finally, a short summary can be found
in the last section.
2 Application
2.1 DFT/FFT sizes
The PFA algorithm envisioned when defining the LTE standard. The smaller the prime numbers in the
factorization are, the simpler the implementation. To simplify the implementation, it is therefore proposed
to limit the uplink scheduling grants to allocations corresponding to DFT precoding sizes that can be written
as a product of the numbers 2, 3, and 5[3], where the power of the prime number 5 is at most 2. The DFT
size must also be a multiple of 12 in LTE (agreed at RAN1#46). In the more recent LTE-advanced (LTEA)
standard, this is reduced to 6. The maximum DFT size is 1296 carriers. As a result the DFT-sizes of Table 1
must be supported.
Next to the DFT-sizes, the FFTs (which are powers of 2) in the range 8 to 2048 must be supported.
2.2 Algorithm development
The PFA implementation of Colorado School of Mines is used as initial/reference implementation [4].
In this code, the DFT size must be factorable into mutually prime factors taken from the set
2,3,4,5,7,8,9,11,13,16. In other words: n = 2 p ∗ 3q ∗ 5r ∗ 7s ∗ 11t ∗ 13u where 0 <= p <= 4, 0 <= q <= 2,
0 <= r, s,t, u <= 1. Obviously, the larger prime factors are not needed for LTE, and the lower primes
have too low factors. Building modules larger than 16 should be avoided as this will result in a very large
code-size when the modules are flattened in the source code.
Like an FFT, the PFA algorithm can be executed in multiple stages. In between two stages a twiddle
multiplication must be performed. The same prime factor cannot be performed within the same stage,
but it can in different stages. So we can perform 3 consecutive PFA stages to support the following sizes:
n = 2 p ∗ 3q ∗ 5r where 0 <= p <= 12, 0 <= q <= 6, 0 <= r <= 2. This is sufficient for the required sizes
of Table 1.

2 APPLICATION 4
no size k in 2k l in 3l m in 5m
1 6 1 1
2 12 2 1
3 18 1 2
4 24 3 1
5 30 1 1 1
6 36 2 2
7 48 4 1
8 54 1 3
9 60 2 1 1
10 72 3 2
11 90 1 2 1
12 96 5 1
13 108 2 3
14 120 3 1 1
15 144 4 2
16 150 1 1 2
17 162 1 4
18 180 2 2 1
19 192 6 1
20 216 3 3
21 240 4 1 1
22 270 1 3 1
23 288 5 2
24 300 2 1 2
25 324 2 4
26 360 3 2 1
27 384 7 1
28 432 4 3
29 450 1 2 2
30 480 5 1 1
31 486 1 5
32 540 2 3 1
33 576 6 2
34 600 3 1 2
35 648 3 4
36 720 4 2 1
37 768 8 1
38 810 1 4 1
39 864 5 3
40 900 2 2 2
41 960 6 1 1
42 972 2 5
43 1080 3 3 1
44 1152 7 2
45 1200 4 1 2
46 1296 4 4
Table 1: DFT sizes that must be supported.

2 APPLICATION 5
#call #ld #st #mul #add/sub

module05 144 1440 1440 1152 4608
module09 80 1440 1440 2240 6400
module16 45 1440 1440 1350 6480
total 4320 4320 4742 17488
avg/cycle 6.00 6.00 6.59 24.29
Table 2: Operation count statistics for DFT size 720.
2.3 Scalar operation count estimation
The dimensioning of the primecore processor is based on the operation count statistics. An initial rough
operation count estimation (+, * and ld/st) is made by instrumenting the single stage scalar code. To give an
idea of the required throughput we report the operation count statistics for the DFT-size of 720 in Table 2.
The numbers do not count address computation and are without call or loop overhead. Also, it must be
noted the operation count will increase further if multiple stages are required (the in between twiddles
require loads and multiplications). The average scalar operations/cycle line shows how many parallel
operations we will have to foresee in the architecture. Over 40 parallel operations must be performed per
cycle.
From these statistics it is clear that we have to maximize the use of Single Instruction Multiple Data (SIMD)
to reach the desired throughput. The maximum common factor over all DFT sizes is 6. Moreover, most data
is loaded/stored as a complex data. Hence, our initial refinement is a vector machine of 6 complex elements
(which totals the number of parallel elements to 12). In the ideal case, an Instruction Level Parallelism (ILP)
of 4 will be needed to achieve a total parallelism of 40 operations per cycle.
2.4 Vectorization
As mentioned before, a DFT can be build up in multiple stages (like an FFT). The common factor 6 in all
DFT sizes allows to split off a 6 element part. In this way, the whole DFT is decomposed in 6 sub- DFT s that
have internally the same data flow (see left hand side of Figure 1). Each sub word will be in a different
sub-DFT (the red arrows in front of the sub-DFT are one vector). The second (split off) part processes the
part among the sub words. A special data path operation is added to perform the final radix-6 butterfly.
The vectorized code becomes very regular an no data repacking is required in the DFT. The right hand
side of Figure 1 shows the data-flow were we use vectorized butterfly operations. All inter-subword-
dependencies are implemented in the irdx6 data path operation.
The same vectorization technique can be applied to an FFT. However, the common factor isn’t 6. Either a
common factor of 4 or 8 must be used. It must also be taken into account that the largest FFT is up to 2K.
Therefore the number of subwords have been expanded to eight. Similar to the DFT, an irdx8 primitive has
been added to vectorize the code.
Instead of building separate irdx6 and irdx8 data paths, we can build an irdx data path operation that executes
either one, depending on a mode bit. This enables to reuse code. The example of Figure 1, was executing a
24-point DFT. By switching the mode to irdx8 the same code is performing a 32-point FFT (of course, also
different twiddle factors are required).
2.5 Special operations
Instead of processing separate adds and subs it is much more efficient to specialize the operations. Butterfly
operations and special multiply operations will be introduced. These special purpose arithmetic operations
are called primitive functions in IP Designer.
Most additions/subtractions are happening in a butterfly. The same two values are added and subtracted.
A combined primitive will be faster as it does the two operations in the same cycle. Moreover it also avoid

2 APPLICATION 6
part1
part2
sub-dft0
irdx6 v6_cbf v6_cbf irdx6
twiddle
v6_cbf
irdx6
twiddle
sub-dft1 irdx6
v6_cbf
irdx6
twiddle
sub-dft2 irdx6
irdx6
twiddle
sub-dft3 irdx6
twiddle
sub-dft4
twiddle
sub-dft5
Figure 1: DFT vectorization.

2 APPLICATION 7
that the same values must be read twice from the register file. The latter will reduce the number of register
ports. Two types of butterflies can be identified: vcbf0 and vcbf1. The difference is that the second operand
is rotated 90 degrees. The basic functionality is described as follows (in Section 2.8 we will add scaling to
the primitive):
// complex butterfly
inline void vcbf0(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1)
{
o0 = vcmplx_t(i0.vreal()+i1.vreal(),i0.vimag()+i1.vimag());
o1 = vcmplx_t(i0.vreal()-i1.vreal(),i0.vimag()-i1.vimag());
}
// complex butterfly with rotated second operand

inline void vcbf1(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1)
{
o0 = vcmplx_t(i0.vreal()+i1.vimag(), i0.vimag()-i1.vreal());
o1 = vcmplx_t(i0.vreal()-i1.vimag(), i0.vimag()+i1.vreal());
}
// complex butterfly with preshift

static base_t sh2mul[] = {1.0,0.5,0.25};
inline void vcbf0_presh(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1, int f)
{
vcmplx_t t1 = i1*sh2mul[f];
o0 = vcmplx_t(i0.vreal()+i1.vreal(),i0.vimag()+i1.vimag());
o1 = vcmplx_t(i0.vreal()-t1.vreal(),i0.vimag()-t1.vimag());
}
Three vector multiplication primitives are required to map the DFT code efficiently. A complex vector
multiplier is required for the twiddle factors. A vector multiplier (multiplying the real and imaginary
with the same scalar factor) is used in the modules. The latter can be sped up by introducing the vsummul
primitive which adds two vector multiplication results. This pattern was used in some of the modules.
By introducing the above mentioned primitives the need for a separate add/sub disappeared. Hence, not
coding them will yield to a smaller instruction set. It must also be noted that the vsummul and the complex
multiplier require the same amount of multipliers and hence can resource share them.
vcmplx_t vsummul(vcmplx_t i0, vcmplx_t i1, cmplx_t c)
{
return i0*c.real() + i1*c.imag();
}
2.6 Reducing load/stores
The number of parallel load stores will dictate the amount of memory ports. Preferable the number of
memory ports is low to ease the IP integration.
A straight forward code reads and writes the data at a module boundary and twiddle boundary. Let’s
consider the DFT size 900. It would require three stages: 30, 5 and 6 (the final stage is always 6, and there
can be only one factor 5 per stage). The first stage would be decomposed in the modules: 2, 3 and 5. As
a result, we load and store the data 14 times (see 7 loads + 7 stores in Figure 2a). Taking into account that
every load/store is a vector of 6 elements and we target a throughput of 1 cycle/element, we can derive the
we can load/store the data in total maximally 6 times (assuming zero overhead).
The number of load stores can be reduced by merging functions. An obvious improvement is to merge
the twid 1to2 computations with the irdx stage. Similar, the twid 1to2 computations can be merged into the
module before, or after. Note, that this increases the code-size as all modules must be implemented with
and without twiddle computation (adding a condition in the inner loop is not an option as this will result in
loss of performance). Similarly, we can consider to merge the twid 1to2+irdx and a module (see Figure 2b).

2 APPLICATION 8
twid_0to1
twid_1to2
a)
module2
module3
module5
module5
ld st ld st ld st ld st ld st ld st ld st
irdx6
twid_0to1
twid_1to2
b)
module2
module3
module5
module5
irdx6
twid_0to1
twid_1to2
c)
module2
module5
module3
module5
irdx6
Figure 2: a)Every element is loaded and stored 14 times for a straight forward implementation of a DFT
size 900. b) merging the twiddles and irdx reduces the load/stores to 8 times. c) smart merging of modules
reduces the load/stores to 4 times.
A further reduction in load/stores can be achieved by merging modules. Here for instance you can merge
the module2 in the stage0, while the module3 is moved to (and merged) the stage1. The load/stores are
avoided by keeping the data in a local register file. Care should be taken that the modules don’t get too big.
Preferable we keep them under 16 such that we can limit the register file. Moreover, smart merging can
avoid that all combinations of sizes with and without irdx/twiddle must be made. A last freedom that can
be taken into account to optimize the combinations is the size of the final stage (irdx6 or irdx8).
The left hand side of table 3 shows the result of the module merging optimization. Note that stage2 is the
irdx primitive and it is merged together with stage1. As a result, all data is read/written maximally 4 times
for any DFT size. Note however, the module code is different when used in stage0 or in stage1. Though
further code replication is avoided by the irdx6/irdx8 mode-bit. Also it can be observed that the code size
is also limited by only having in stage0 the modules 5,8,9 and 10. This table has also been coded in the
application source code. It is a lookup table to select the correct functions.
The FFT can be implemented with the same structure. A module16 has been added to stage0 to implement
all needed FFT sizes.
2.7 Vector operation count estimation
A refined operation count statistics is measured after the source code optimizations of the previous sub-
sections. Based on these new statistics it is be feasible to build an architecture that meets the real-time
constraint. The statistics are given in the middle part of Table 3.
We have measured the average number of load/stores, multiply and butterfly operations. The irdx and
loading of coefficients are not measured as they will always be lower than the number of stores (and hence
a less critical resource). Based on the average number of operations per cycles statistics we could conclude
that one unit per type could be sufficient. However, we should be careful for the following pitfalls:
• Loop overhead. The modules contain a loop that iterates over the data. Software pipelining (or loop
folding) will be applied by the compiler to exploit the ILP. However, the additional cycles to setup
the loop can be significant for low iteration counts. In our case we have low iteration counts ranging
from 1 to 18. The loop overhead can be reduced by allowing delay slots instructions after the loop
setup instruction.

2 APPLICATION 9
no size stage0 stage1 stage2 vload/vstore-data vmultiplications vbutterflies primecore

# avg # avg # avg # rel
1 6 — — irdx6 2 0.3333 0 0.0000 0 0.0000 4 0.67
2 12 — 2 irdx6 4 0.3333 2 0.1667 1 0.0833 12 1.00
3 18 — 3 irdx6 6 0.3333 4 0.2222 3 0.1667 17 0.94
4 24 — 3 irdx8 6 0.2500 4 0.1667 3 0.1250 17 0.71
5 30 — 5 irdx6 10 0.3333 8 0.2667 7 0.2333 20 0.67
6 36 — 6 irdx6 12 0.3333 8 0.2222 9 0.2500 32 0.89
7 48 — 6 irdx8 12 0.2500 8 0.1667 9 0.1875 32 0.67
8 54 — 9 irdx6 18 0.3333 16 0.2963 18 0.3333 44 0.81
9 60 — 10 irdx6 20 0.3333 16 0.2667 19 0.3167 47 0.78
10 72 — 12 irdx6 24 0.3333 19 0.2639 24 0.3333 59 0.82
11 90 — 15 irdx6 30 0.3333 29 0.3222 36 0.4000 69 0.77
12 96 — 16 irdx6 32 0.3333 24 0.2500 34 0.3542 69 0.72
13 108 — 18 irdx6 36 0.3333 32 0.2963 45 0.4167 79 0.73
14 120 — 15 irdx8 30 0.2500 29 0.2417 36 0.3000 69 0.57
15 144 — 18 irdx8 36 0.2500 32 0.2222 45 0.3125 79 0.55
16 150 5 5 irdx6 100 0.6667 80 0.5333 70 0.4667 140 0.93
17 162 9 3 irdx6 108 0.6667 84 0.5185 81 0.5000 156 0.96
18 180 5 6 irdx6 120 0.6667 88 0.4889 87 0.4833 165 0.92
19 192 8 3 irdx8 96 0.5000 62 0.3229 63 0.3281 141 0.73
20 216 9 3 irdx8 108 0.5000 84 0.3889 81 0.3750 156 0.72
21 240 5 6 irdx8 120 0.5000 88 0.3667 87 0.3625 165 0.69
22 270 5 9 irdx6 180 0.6667 152 0.5630 153 0.5667 229 0.85
23 288 9 4 irdx8 144 0.5000 100 0.3472 108 0.3750 191 0.66
24 300 5 10 irdx6 200 0.6667 160 0.5333 165 0.5500 250 0.83
25 324 9 6 irdx6 216 0.6667 168 0.5185 189 0.5833 265 0.82
26 360 5 9 irdx8 180 0.5000 152 0.4222 153 0.4250 229 0.64
27 384 8 6 irdx8 192 0.5000 124 0.3229 150 0.3906 238 0.62
28 432 9 6 irdx8 216 0.5000 168 0.3889 189 0.4375 265 0.61
29 450 5 15 irdx6 300 0.6667 265 0.5889 285 0.6333 402 0.89
30 480 5 12 irdx8 240 0.5000 191 0.3979 204 0.4250 306 0.64
31 486 9 9 irdx6 324 0.6667 288 0.5926 324 0.6667 377 0.78
32 540 9 10 irdx6 360 0.6667 304 0.5630 351 0.6500 414 0.77
33 576 8 12 irdx6 384 0.6667 272 0.4722 348 0.6042 457 0.79
34 600 5 15 irdx8 300 0.5000 265 0.4417 285 0.4750 402 0.67
35 648 9 12 irdx6 432 0.6667 363 0.5602 432 0.6667 510 0.79
36 720 10 9 irdx8 360 0.5000 304 0.4222 351 0.4875 412 0.57
37 768 8 16 irdx6 512 0.6667 352 0.4583 480 0.6250 601 0.78
38 810 9 15 irdx6 540 0.6667 501 0.6185 594 0.7333 686 0.85
39 864 9 16 irdx6 576 0.6667 472 0.5463 594 0.6875 672 0.78
40 900 10 15 irdx6 600 0.6667 530 0.5889 645 0.7167 755 0.84
41 960 10 12 irdx8 480 0.5000 382 0.3979 468 0.4875 559 0.58
42 972 9 18 irdx6 648 0.6667 576 0.5926 729 0.7500 798 0.82
43 1080 9 15 irdx8 540 0.5000 501 0.4639 594 0.5500 686 0.64
44 1152 8 18 irdx8 576 0.5000 436 0.3785 594 0.5156 713 0.62
45 1200 10 15 irdx8 600 0.5000 530 0.4417 645 0.5375 755 0.63
46 1296 9 18 irdx8 648 0.5000 576 0.4444 729 0.5625 798 0.62
47 8 — — irdx8 2 0.2500 0 0.0000 0 0.0000 4 0.50
48 16 — 2 irdx8 4 0.2500 2 0.1250 1 0.0625 12 0.75
49 32 — 4 irdx8 8 0.2500 4 0.1250 4 0.1250 28 0.88
50 64 — 8 irdx8 16 0.2500 10 0.1562 13 0.2031 43 0.67
51 128 — 16 irdx8 32 0.2500 24 0.1875 34 0.2656 69 0.54
52 256 8 4 irdx8 128 0.5000 72 0.2812 84 0.3281 172 0.67
53 512 8 8 irdx8 256 0.5000 160 0.3125 208 0.4062 307 0.60
54 1024 8 16 irdx8 512 0.5000 352 0.3438 480 0.4688 601 0.59
55 2048 16 16 irdx8 1024 0.5000 768 0.3750 1088 0.5312 1189 0.58
Table 3: Left: DFT/FFT stage sizes. Middle: Statistics for ld/st, mult and butterflies in optimized code.
Right: cycle measurements on primecore.
2 APPLICATION 10
• Module switching overhead. A switch statement can have a significant overhead (easily tens of
cycles). This can be a significant overhead for the smaller DFTs. Simply inlining all modules is not
a solution as the code size will explode. Therefore we propose to use function pointers. These allow
fast switching to any module function. For the smaller DFT sizes we still can specialize the code to
further reduce the overhead.
• Loading module coefficients. Some of the modules require coefficients, which need to be generated.
Generating constants from immediates values in the instruction word is undesired as it increases the
length of the instruction word. The better alternative is to load coefficients from a memory. In that
case the overhead of loading these coefficients should be reduced as much as possible. This can be
done by packing multiple coefficients in a vector.
• Limited scheduling freedom. Even when we have sufficient functional units, it can very well be that
the operations cannot be scheduled efficiently due to data dependencies. It can also be that unbalance
in the loops causes a suboptimal scheduling (for instance, the first module has many load stores, while
the second module needs many butterflies). The scheduling freedom can be improved by introducing
indexed load/stores (for data and coefficients). All module accesses are of the following pattern
p[# ∗ n] (where # is an immediate and n a variable). The compiler can then reorder the indexed
accesses such that they match the schedule. An efficient architecture is best matching this pattern to
exploit the scheduling freedom.
The initial processor architecture we have in mind has 5 parallel slots:
1. One vload/vstore unit for data.
2. One vload unit for coefficients.
3. One multiplier unit for vector-mul, complex-vector-mul and vsummul.
4. One butterfly unit.
5. One irdx unit.
More accurate results can only be obtained by working out the architecture and compiling the code.
2.8 Fixed point and data scaling
The vectorization and merging optimizations have been performed on double precision code. Checking the
correctness of the source code in double precision is easier as the error in the output signal should be very
small. Quantization errors are introduced when making the code fixed point. Some effort has been spent
to reduce the quantization error by adding scale and round functions in the butterfly and irdx primitives. The
scaling will result in a higher gate count but will yield to a better Signal-to-Noise Ratio (SNR). The current
primitive allows independent scaling of the input. And the outputs are scaled together. To be concrete, the
scaling version of vcbf0 is as follows:
base_t sh2mul[] = { 1.0, 0.5, 0.25, 0.125 };
void vcbf0(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1,
/* scaling factors: */ int si0, int si1, int so)
{
vcmplx_t t0 = i0*sh2mul[si0];
vcmplx_t t2 = t1*sh2mul[si1];
o0 = (t0+i1)*sh2mul[so];
o1 = (t0-t2)*sh2mul[so];
}
Of course, the scaling will be implemented as a shift in hardware. In fact, we have tried multiple rounding
schemes. Especially the output scaling is sensitive to rounding error accumulation [1]. Therefore we use a
round to even function for the output scaling and a round down function for the input scaling.
Having an appropriate scaling/rounding can save quite some hardware-bits and hence make the architecture
cheaper. An ASIP allows to flexibly design both the rounding instructions and the bit-width to come up with
the optimal solution.

2 APPLICATION 11
While developing the code it is interesting to keep the double precision version as a reference. The appli-
cation code has been setup in such a way that native compilation can be done in double precision and in
fractional mode.
2.9 Source code structure of modules
All modules are build up similarly to the following one.

inline void vmodule5io(vcmplx_t i0, vcmplx_t i1, vcmplx_t i2, vcmplx_t i3, vcmplx_t i4,
vcmplx_t& o0, vcmplx_t& o1, vcmplx_t& o2, vcmplx_t& o3, vcmplx_t& o4,
base_t c1,
cmplx_t cm1, cmplx_t cm2)
{
vcmplx_t t1,t2,t3,t4,t5,t6,t7,y1,y2,y3,y4,y6;
vcbf0(i1,i4, t1,t3, /* scaling: */ 0,0,SCALE);
vcbf0(i2,i3, t2,t4, /* scaling: */ 0,0,SCALE);
vcbf0(t1,t2, t5,t6, /* scaling: */ 0,0,SCALE);
y6 = t6 * c1;
vcbf0(i0,t5, o0,t7, /* scaling: */ SCALE*2,2,SCALE);

vcbf0(t7,y6, y1,y2, /* scaling: */ 0, 0,0);
y3 = vsummul(t3,t4,cm1);
y4 = vsummul(t3,t4,cm2);
vcbf1(y1,y4, o4,o1, /* scaling: */ 0,0);
vcbf1(y2,y3, o3,o2, /* scaling: */ 0,0);
}
void vmod5_2(vcmplx_t vIn[], vcmplx_t vOut[], int chess_storage(m0) n)

{
vcmplx_t c = vTC5;
vcmplx_t* restrict pIn = vIn;
vcmplx_t* restrict pOut = vOut;
vcmplx_t chess_storage(CM) * restrict pTwid = vTwid_1to2;
idx_t stepn=n;
for (int k1=0; k1<n; k1++) chess_loop_range(1,) {
vcmplx_t i0,i1,i2,i3,i4;
vcmplx_t t0,t1,t2,t3,t4;
vcmplx_t o0,o1,o2,o3,o4;
i0 = pIn[0*stepn];
i1 = pIn[1*stepn];
i2 = pIn[2*stepn];
i3 = pIn[3*stepn];
i4 = pIn[4*stepn];
pIn++;
vmodule5io(i0,i1,i2,i3,i4, t0,t1,t2,t3,t4,
select_tc(c,0),cselect_tc(c,2),cselect_tc(c,0));
t0 = t0 * pTwid[0*stepn]; o0 = irdx(t0,mrRS);
pTwid++;
vcmplx_t* restrict p0 = pOut; p0[0*stepn] = o0;

3 ARCHITECTURE 12

pOut++;
}
}
Remarks:
• The module vmod5 2 is separated from the control flow and accesses in vmodule5io, such that it can be
reused.
• The idx t type implements the special indexed addressing mode.
• All constants are loaded by one vector vcmplx t c = vTC5;, the select tc functions extract the individual
constants.
• The chess storage(CM) pragma indicates that the coefficients are in the CM memory.
• The mrRS variable selects the irdx6 or irdx8 mode.
2.10 Guiding compiler
By guiding the compiler with pragmas the cycle budget can be reduced. The pragmas were tuned after
building the architecture. We have done the following:
• Stating that a loop has at least 1 iteration, with the pragma chess loop range(1,). This avoids that the
compiler inserts a condition to check for minimally 1 iteration.
• Stating were the function parameter is stored with the pragma int chess storage(m0) n. This avoids a
move from an argument register to the M register.
• The compiler was suboptimal in assigning the input data to registers. This is more difficult, because
the set of registers that can load data is limited. Therefore we have assigned them manually for
the problematic modules as follows: vcmplx t CHESS STORAGE MOD9(V0) i0;. We are investigating to
improve the compiler on this issue.
• Restrict points to give scheduling freedom to the compiler (also for the individual writes).
3 Architecture
3.1 Overall structure
The Primecore architecture has been implemented to match the data flow of the modules (see Section 2.9).
A diagram of the vector related part is given in Figure 3. There are 2 memories: DM for the data, and
CM for the coefficients. The main register file is V[24]. It can store all the temporary data for a module
without having to spill that data to memory. An effort is made to limit the number of ports on this large
register file and thus to keep the gate count low. The two read ports to the VEC 0 unit can only read from
half the register file. Also the loads from memory and the results from the multiplier have a limited range
to write to (shown as the green and purple part). The X[6] register file contains the data that goes to the
multiplier. The (twiddle) constants for the multiplier are stored in T. The TC register is meant for the longer
term constants, while TW is meant for the short lifetime twiddle-factors. The VSEL 0 unit allows to select
the required constant. The multiplied results can be either written back to the V register file or to the RDXI
register to be processed for the final stage. The VEC 1 unit performs the final irdx function and its output
is written back to memory via the RDXO register. The result can be stored from the RDXI register if the
module is in the first stage.

3 ARCHITECTURE 13
DM CM
V[24]
VA[12]
X[6]
VB[12]
T[2]
RDXI
TC
TW
vsel0
vec0 vec1
vmpy0
(butterfly) (irdx)
(vcmul, vmul, vsummul)
RDXO
Figure 3: Primecore architecture.
Remarks:
• The processor has 5 parallel ILP issue slots. The three functional units process in parallel. Moreover,
the two memories can load/store data in parallel.
• To meet the required frequency of 500MHz we have used a pipelined multiplier.
3.2 Primitives
Coding the primitives is done in a C-like language with the PDG tool. For instance the butterfly primitive
can be coded as follows.
void vcbfly0(vcword vi0,vcword vi1, vcword& vo0, vcword& vo1,uint2 f0,uint2 f1,uint2 f2)
{
vcword r0,r1;
vcword t0 = vcscale(vi0,f0);
vcword t1 = vcscale(vi1,f1);
for (int32_t i = 0; i < VSIZE; i++) {
r0[i] = cmplx(ext_re(t0[i])+ext_re(vi1[i]), ext_im(t0[i])+ext_im(vi1[i]));
r1[i] = cmplx(ext_re(t0[i])- ext_re(t1[i]), ext_im(t0[i])- ext_im(t1[i]));
}
vo0 = vcscale(r0,f2);
vo1 = vcscale(r1,f2);
}
The only primitive that mandates a more detailed explanation is the irdx primitive. Special care is taken to
avoid duplication of resources and a long critical path. The irdx6 and irdx8 are written down in one function
to enforce resource sharing. Also, the fixed multiplications in these primitives are written down as a sum

3 ARCHITECTURE 14
of shifts to avoid the instantiation of a multiplier. Even the summing has been written down to share the
adders (in mul0 irdx). The whole primitive is computed on 18-bits precision, and the result is rounded at the
end. The irdx primitive has become in this way a small functional unit that meets the timing constraint of a
500 MHz clock.
vcword irdx(vcword a,uint1 rdx8)
{
cwordP2 dummy_P2 = (cwordP2)0;
cwordP2 i0 = cnvt_cwordP2(a[0]);
uint1_t rdx6=1-rdx8;
cwordP2 t1A,t2A,t3,t4,t5A,t6A,t7A,t8A,t9A,t10A,t11A,t12A;
cwordP2 y1A,y2A,y3A,y4A,y5A,y6A,y7A;
cwordP2 o0A,o1A,o2A,o3A,o4A,o5A,o6A,o7A;
cbf0_P22( i1, i5, t3, t4, /* scaling: */ 0);
cbf0_P22( i2, rdx8?i6:i4, t5A, t6A, /* scaling: */ 0);
cbf0_P22( i0, rdx8?i4:t5A, t1A, t2A, /* scaling: */ rdx6);
cbf0_P22( i3, rdx8?i7:t3, t7A, t8A, /* scaling: */ rdx6);
cbf0_P22( t1A,rdx8?t5A:t7A, t9A, y2A, /* scaling: */ 0);
// for rdx8 only:

cbf0_P22( t3,t7A, t10A,y6A, /* scaling: */ 0);
cbf0_P22( t4,t8A, t12A,t11A, /* scaling: */ 0);
cwordP2 m1,m2;
m1 = mul0_irdx(rdx8,t11A,t6A);
m2 = mul1_irdx(rdx8,t12A,t4);
t6A = cmplxP2(ext_im_P2(t6A),-ext_re_P2(t6A));
cbf0_P22( t2A,m1, y1A,y3A,/* scaling: */ 0);
cbf0_P22(rdx8?m2:t8A, rdx8?t6A:m2, y7A,y5A,/* scaling: */ 0);
// for rdx8 only:

cbf0_P22( t9A,t10A, o0A,o4A, /* scaling: */ 0);
cbf1_P22( y2A,y6A, o6A,o2A);
cbf0_P22( y1A,y7A, o7A,o1A, /* scaling: */ 0);

cbf0_P22( y3A,y5A, o5A,o3A, /* scaling: */ 0);
vcword r;
r[0] = cscale_P2(rdx8?o0A:t9A,2);
r[1] = cscale_P2(rdx8?o1A:o3A,2);
r[3] = cscale_P2(rdx8?o3A:y2A,2);
r[6] = cscale_P2(rdx8?o6A:dummy_P2,2);
r[7] = cscale_P2(rdx8?o7A:dummy_P2,2);
return r;
}
wordP2 mul0i_irdx(uint1_t rdx8, wordP2 a, wordP2 b)

4 RESULTS 15
{
wordP2 ina = a;
wordP2 inb = b;
wordP2 inbr = ~inb;
wordP2 part0 = rdx8?(ina>>1) :(inbr);
wordP2 part1 = rdx8?(ina>>3) :(inb>>3);
wordP2 part5 = rdx8?(ina>>14):1;
wordP2 r1 = (part0+part1+part2+part3+part4+part5);
return r1;
}
// similar to mul0i_irdx are:
wordP2 mul0r_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
wordP2 mul1i_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
wordP2 mul1r_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
4 Results
This section will first discuss the cycle counts, followed by synthesis results. The final section deals with
the code size.
4.1 Cycle count
The cycle count requirements are met by designing an architecture with sufficient parallelism. The compiler
has to find a schedule to use all resources efficiently. We determined the cycle count underbound for the
Primecore architecture. We will show that the compiler comes very close to this underbound.
The cycle counts for the DFT/FFT sizes are given in the right hand side of Table 3. The cycle budget is met
for all sizes. For some of them, we even have up to 50% slack remaining. This is because the architecture
is designed for the worst case, and some modules are simpler to implement then others. The main reason
for slack is that we can use the SIMD-8 hardware for some DFT sizes.
The cycle count is for a very large part determined by the scheduling of loops. The loop folding transfor-
mation schedules multiple iterations of a loop together to exploit the available ILP. The Initiation Interval
(II) is the most important figure of merit, it is the number of cycles required for 1 iteration. The folding
requires the creation of a pre/post able to the loop. The cycles in the pre/post amble can be significant when
the iteration count is low.
Figure 4 shows the scheduled result of vmod10. The red arrow at the left shows the part that is repeated
by the loop (from 429 to 410). The II of the loop is 20 cycles. This is the minimum for this loop on
this architecture, as 20 loads and stores must happen (see the red box). The loop is issued by instruction
407. The instruction 408 and 409 are, so called, delay slots. These delay slots can issue useful pre-amble
instructions, but are not yet part of the loop. Delay slots are required because it takes a few cycles to
initialize the registers that control the zero overhead loop.
Most modules have a minimum cycle count caused by the load/store unit. Every element must be loaded
once, and stored once. So the module10 requires 20 load/stores. Only the large modules are limited by
the number of butterflies. The module16 and module18 require respectively 34 and 46 butterfly operations.
Table 4 shows how close we are to the minimum-II for most modules. Also by inspecting the schedules
you can observe that a minimum of cycles is lost in the preamble.
The most important conclusion that we can draw is that the available hardware resources are used effi-
ciently.

4 RESULTS 16
{vmod10_s1 void_vmod10_s1___Pvcmplx_t___Pvcmplx_t___sint}
397 nop ; nop ; v03=dm[p0+6*m0] ; nop ; p2=#0

398 v04=v03 ; nop ; v13=dm[p0+4*m0] ; nop ; r0=m0
399 v06=(v04 +v13 )/2; x01=(v04 -v13 )/2; nop ; v15=dm[p0+7*m0] ; tc=cm[p2] ; p2=#640
400 v05=v15 ; nop ; v02=dm[p0+2*m0] ; nop ; m1=#16
401 nop ; nop ; v14=dm[p0+8*m0] ; nop ; nop
402 v17=(v02 +v14 )/2; x05=(v02 -v14 )/2; nop ; v01=dm[p0+1*m0] ; nop ; nop
404 v16=(v05 +v12 )/2; x00=(v05 -v12 )/2; v14 = vmul(x07,tc[[1]]) ; v15=dm[p0+9*m0] ; nop ; nop
406 v10=(v00/4+v18/4)/2; v09=(v00/4-v18/4)/2; v13 =vsmul(x03,x00,tc[0]); v03=dm[p0+5*m0] ; nop ; nop
407 v19=(v07 +v16 )/2; x07=(v07 -v16 )/2; nop ; nop ; nop ; do r0,429
408 v20=(v03/4+v19/4)/2; v11=(v03/4-v19/4)/2; v15 = vmul(x07,tc[[1]]) ; nop ; [p0+m1] ; nop
409 x06=(v10 +v20 )/2; x07=(v10 -v20 )/2; v12 =vsmul(x01,x05,tc[1]); nop ; tw=cm[p2+5*m0]; nop
410 v08=(v11 +v15 ) ; v11=(v11 -v15 ) ; v15 =vsmul(x01,x05,tc[0]); v03=dm[p0+6*m0] ; nop ; nop
411 v23=(v11 +v13r ) ; v21=(v11 -v13r ) ; rdxi=vcmul(x07,tw) ; v02=dm[p0+2*m0] ; tw=cm[p2+0*m0]; nop
412 v09=(v09 +v14 ) ; v10=(v09 -v14 ) ; rdxi=vcmul(x06,tw) ; v13=dm[p0+4*m0] ; tw=cm[p2+8*m0]; nop
413 v11=(v10 +v15r ) ; v10=(v10 -v15r ) ; v15 =vsmul(x03,x00,tc[1]); v14=dm[p0+8*m0] ; nop ; rdxo=irdx(rdxi)
414 x07=(v11 +v23 )/2; x05=(v11 -v23 )/2; nop ; dm[p1+5*m0]=rdxo; nop ; rdxo=irdx(rdxi)
415 x02=(v10 +v21 )/2; x07=(v10 -v21 )/2; rdxi=vcmul(x07,tw) ; dm[p1+0*m0]=rdxo; tw=cm[p2+7*m0]; nop
416 v22=(v08 +v15r ) ; v23=(v08 -v15r ) ; rdxi=vcmul(x07,tw) ; v15=dm[p0+7*m0] ; tw=cm[p2+6*m0]; nop
417 v10=(v09 +v12r ) ; v11=(v09 -v12r ) ; nop ; v12=dm[p0+3*m0] ; nop ; rdxo=irdx(rdxi)
418 x07=(v11 +v23 )/2; x04=(v11 -v23 )/2; nop ; dm[p1+8*m0]=rdxo; nop ; rdxo=irdx(rdxi)
419 x07=(v10 +v22 )/2; x06=(v10 -v22 )/2; rdxi=vcmul(x07,tw) ; dm[p1+7*m0]=rdxo; tw=cm[p2+4*m0]; nop
420 v04=v03 ;v05=v15 ; rdxi=vcmul(x07,tw) ; v01=dm[p0+1*m0] ; tw=cm[p2+3*m0]; nop
421 v06=(v04 +v13 )/2; x01=(v04 -v13 )/2; rdxi=vcmul(x05,tw) ; v15=dm[p0+9*m0] ; tw=cm[p2+2*m0]; rdxo=irdx(rdxi)
422 v17=(v02 +v14 )/2; x05=(v02 -v14 )/2; rdxi=vcmul(x02,tw) ; dm[p1+6*m0]=rdxo; tw=cm[p2+1*m0]; rdxo=irdx(rdxi)
423 v16=(v05 +v12 )/2; x00=(v05 -v12 )/2; rdxi=vcmul(x04,tw) ; dm[p1+4*m0]=rdxo; tw=cm[p2+9*m0]; rdxo=irdx(rdxi)
424 v18=(v06 +v17 )/2; x07=(v06 -v17 )/2; rdxi=vcmul(x06,tw) ; dm[p1+3*m0]=rdxo; [p2+m1] ; rdxo=irdx(rdxi)
425 v07=(v01 +v15 )/2; x03=(v01 -v15 )/2; v14 = vmul(x07,tc[[1]]) ; dm[p1+2*m0]=rdxo; tw=cm[p2+5*m0]; rdxo=irdx(rdxi)
426 v19=(v07 +v16 )/2; x07=(v07 -v16 )/2; v12 =vsmul(x01,x05,tc[1]); v00=dm[p0+0*m0] ; nop ; nop
427 v10=(v00/4+v18/4)/2; v09=(v00/4-v18/4)/2; v15 = vmul(x07,tc[[1]]) ; v03=dm[p0+5*m0] ; [p0+m1] ; nop
428 v20=(v03/4+v19/4)/2; v11=(v03/4-v19/4)/2; v13 =vsmul(x03,x00,tc[0]); dm[p1+1*m0]=rdxo; nop ; rdxo=irdx(rdxi)
429 x06=(v10 +v20 )/2; x07=(v10 -v20 )/2; nop ; dm[p1+9*m0]=rdxo; [p1+m1] ; nop
430 rt
Figure 4: The scheduled vmod10 reaches the minimum II of 20. The critical resource are the 20 data
memory load/stores.
module min-II scheduled-II %

vpfafft-vmod5 s0 10 10 0.0%
Table 4: Minimum and actual II of loops of the modules. Nearly always the minimum is reached.

4 RESULTS 17
scalar Fus 2% memIf 2% rest 0%

scalar regs 3%
dec+ctrl 2% reg_V
29%
vmpy
36%
reg_X
8%
other vec_regs
3%
vec1 (irdx) vec0 (bf)
5% 10%
Figure 5: Gatecount distribution for Primecore.
frequency gatecount area

500Mhz 322kGates 0.515 mm2
550Mhz (max) 357kGates 0.570 mm2
Table 5: Gatecount for at various frequencies.
4.2 Synthesis results
An RTL implementation has been generated of the architecture presented in Section 3. Our default synthe-
sis script has been applied to synthesize the core. Used tools and libraries:
Synopsys: synC-2009.06-SP4
Lib: tsmc065gp
Since we do not have a full-option synthesis tool we have suboptimal results. In particular, we need a
pipelined multiplier to meet the target frequency of 500MHz, but our Designware library dies not contain
one. In our synthesis experiments we have inserted our own pipelined multiplier as our synthesis tools do
not have retiming enabled. However, it is best to use the retiming option of Synopsis to obtain an optimal
pipelined multiplier. Therefore the numbers in this section should be interpreted as rough indications,
which can be improved with better synthesis tools and libraries.
The architecture is sliced to ease the synthesis. Slicing allows to the synthesis tool to process the architec-
ture per SIMD element rather than synthesizing the whole at once. The main advantage is faster synthesis
results.
The design synthesizes into 320kGates at 500Mhz (other frequencies can be found in Table ??. The gate
count distribution is shown in Figure 5. The two dominant contributers are the pipelined vector (complex)
multiplier and the register file V. The vector multiplier contains 4×8 scalar multipliers of 16×16→32 bits.
These will reduce in size when using a better synthesis tool and a better library. The register file V cannot
be reduced in size without affecting the cycle budget. It must be able to hold the data for a complete
module.
4.3 Code size
The program memory size is (obviously) highly dependent on the whole program. Counting the testbench
program that is loading/storing data via hosted IO is not intresting. Table 6 reports 3 components: modules,

4 RESULTS 18
component size (#words) estimated bitwidth estimated size (in bits)

modules (parallel code) 607 64 38848
control code 50 32 1600
init code 44 32 1408
Total 41856
Table 6: Code size for the DFT/FFT algorithm.
Comparison min-SNR avg-SNR max-SNR

ddft-qdft (76.0,72.2) (81.7,77.8) (91.8,84.4)
ddft-dft (65.4,61.1) (74.7,71.4) (91.8,84.4)
ddft-pfa (60.9,59.7) (71.4,69.0) (91.8,84.4)
Table 7: S NR for the DFT/FFT algorithm (min/average/max over all sizes). Real and imaginary measured
separately.
control and initialization code. The code size is the number of words of the compiled code. The modules
form the bulk of the code. A lot of care was taken to minimize code duplication while maximizing the
performance.
Currently, the instruction encoding is generated automatically. The instruction word is 91 bits. A rough
estimate is that we reduce it to 64 bit by manually optimising the instruction encoding. The not parallel
instructions can even be reduced further to 32-bit, or lower. A combined long/short instruction set can
achieve such code size reduction. The IP-designer tool suite supports these combines instruction sets.
An projection of the overall code size is given in the last column of Table 6. It can be expected that the
control code and initialization code will grow when embedding the application in its context. Also then,
the combined instruction set allows to limit the overhead.
4.4 Signal to noise ratio
The signal to noise ratio is an important quality measure. The bitwidth of the processor contributes to the
quality of the signal. But even, as important is the used algorithm and rounding. Finally, last but not least is
the signal itself. A small signal will never obtain a good signal to noise ratio. We have done some limited
measurements to get a feeling. A more thorough measurement is required though.
The generated test-input-signal is the summation of some sinus signals (see function generate input, an ex-
ample signal is also shown in Figure 6). This will result in some peaks in the frequency domain. We use
a double precision DFT as a reference signal (labeled “ddft”). The signal degradation will be different for
the various DFT sizes. Therefore we report a minimum, maximum and an average. We compute the SNR
for the following cases:
• Quantized DFT signal (labeled “qdft”). Some error is already made by quantizing the output signal.
• Fractional straight forward DFT (labeled “dft”). This DFT is computed with 16-bit precision inter-
nally. The summation is a binary tree which scales the sum with 1 bit every level. This is the most
fair comparison algorithm. Scaling before summation results in a poor SNR while summing in 32-bit
precision and scaling afterwards uses a much higher precision.
• PFA-DFT (labeled “pfa”).
Table 7 reports the real and imaginary SNR separately for the above mentioned cases. The quantized DFT
has an average SNR of about 80dB. About 6dB is lost when the internal precision is made 16-bit. Our PFA
based computation looses only 3dB extra. More detailed information is reported when running the native
code (see Section 5.1).

5 EVALUATION 19
5 Evaluation
This section gives some guidance about how to evaluate the primecore. Through a tutorial-like description,
we will guide you through the various tasks, i.e. to compile and run the code natively, to compile and run
the code on the target processors, to build an Instruction Set Simulator (ISS), and to generate and synthesize
V ERILOG. The processor description can be found in the primecore directory, while the application code
can be found in the vpfa directory. These scripts have been tested under Suse Enterprise Linux 11 gcc4.3.2.
It is assumed that the IP Designer tool-suite is installed and a license is available [7].
5.1 Native compilation and execution
During the development we have used different data types. The initial code has been developed in double
precision. Then a fixed point fractional type has been used. This type has been refined to the 16-bit bit-
true type of the primecore. Since, the current application code is probably not the end-point we have kept
compilation for the double precision and fractional type.
The native test bench generates an input signal by summing a number of sinus and cosinus signals of
different frequency and amplitude (real and imag). This should result in some peaks at the corresponding
frequency location.
The application code can be compiled in double precision, executed and tested with the test double.sh
script. It should execute a straight forward DFT and a PFA-DFT. The program compares the results and
sums the total error. The final output should report that there are no differences:
TOTAL error_pfa_dft =(0.000000,0.000000)
TOTAL error_pfa_ddft=(0.000000,0.000000)
TOTAL error_dft_ddft=(0.000000,0.000000)
Similarly, the fractional type can be used with the test fract.sh script. This application computes the DFT
in double precision as a reference (named “ddft”). It also computes the a straight-forward DFT using the
fractional type (named “dft”). The quantization during the fractional multiplications will show the minimal
signal distortion for quantizing. Finally, the PFA-DFT is executed. The error among these 3 functions
is determined and printed. As can be expected, the error will increase when performing rounding and
scaling. The signals and errors can be dumped to files for one DFT of your choice for further inspection (by
setting #define DUMP_DFT_SIZE 162). The dumped files can be visualized with jgraph [5] or another
application. Figure 6 gives an idea of the output for the jgraph application. The jgraph command is as
follows:
jgraph test.jgr > out.eps ; ghostview out.eps
The fractional type has 32-bit precision internally and wrapping or clipping should not cause problems.
However, this will be a problem when using the 16-bit type of the processor. The rounding in the application
source should prevent such overflows. The fractional type evaluates what is the minimum and maximum
value used. The overflow problems should be solved before using the bit-true type.
The final native simulation is using the primitives of the processor and will therefore compute the bit-true
output when running the application on the primecore (execute test bittrue.sh). The output should be
equal to the previous simulation if no overflow had occurred. Note that you have to build the ISS primitives
first, before you can run this script (so run the update_proc.sh script first).
5.2 On target compilation and execution
Generating an input signal is difficult and time consuming on the Primecore (sin/cos computation is not
trivial). Therefore we use the native testbench to generate fixed-point input data and reference data for all
sizes. Hosted IO is used on the ISS to load the input, coefficients and reference data and save the output
data.

5 EVALUATION 20
Input signal double precision DFT (ref)

60
40
0.5 0.5
ddft-REF-im
ddft-REF-re
40
input-im
input-re
0.0 0.0 20
20
-0.5 -0.5 0
0
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
quantizedDFT diff quantizedDFT - doubleDFT

60
40
diff_ddft_qdft.im
diff_ddft_qdft.re
0.01 0.01
40
qdft-im
qdft-re
20 0.00 0.00
20
0 -0.01
quantization noise -0.01
0
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
PFADFT diff PFADFT - doubleDFT

60
40
0.01 0.01
diff_ddft_pfa.im
diff_ddft_pfa.re
dftout_pfa.im
dftout_pfa.re
40
20 0.00 0.00
20
0 -0.01 +/- 1bit -0.01

0
0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150
diff quantizedDFT- PFADFT

0.01 0.01
diff_qdft_pfa.im
diff_qdft_pfa.re
0.00 0.00
-0.01 -0.01
0 50 100 150 0 50 100 150
Figure 6: In/out/error signal from fractional testbench (example size=162).

7 SUMMARY 21
An ISS must be build before starting simulation. The following script updates the ISS and all runtime
libraries for the current processor model.
cd primecore/lib; ./update_proc.sh
The measure.sh script can be used to compile the application code and run it on the ISS. The script also
verifies if the produced output data corresponds to the native computed output. Finally, it reports the timing
measurements.
5.3 Verilog generation and testing
Generating an input signal is difficult and time consuming on the Primecore (sin/cos computation is not
trivial). Also the hosted IO is not supported for RTL simulation 1 . Therefore another testbench code has
been made that compiles the input data and reference data into the executable. The application code is first
run on the ISS logging all register changes. In a second step, the same code is run on the simulated RTL
code. The RTL code creates also a Register Change Dump (RCD). The two RCDs should be equal. The
following script executes the whole process:
cd primecore/hdl ; ./test_hdl.sh
6 Future work
The primecore has the basis functionality to perform the DFT/FFT algorithms. However, some further
tuning of the core is required. Especially, the context in the system has to be refined. Possible next steps
are:
• Synchronization with the rest of the system (fifo communication?).
• Coefficient generation is offline.
• The rounding requires further investigation.
• Optimising the instruction encoding to reduce the instruction word (see Section 4.3).
7 Summary
The Primecore shows the potential of the IP Designer tool suite in the context of LTE base stations. Proces-
sors can be used to meet the requirements to efficiently implement the FFT and DFT algorithm. Convincing
evidence is given that a tuned architecture can be build that meets the tight cycle budget at high clock
frequency constraints. A 6/8-way SIMD machine has been build. Special instructions have been build to
efficiently process butterflies with scaling. Also dedicated instructions for performing a radix-6 and radix-8
among the SIMD elements have been added. For all symbol sizes we meet a performance that is better than
1 cycle per complex input sample. For some sizes we even have quite some slack. This slack can used
to achieve an even higher throughput, or to power down the processor each time a certain frame has been
processed. The pipelined architecture can run over 500Mhz. The total gate count is around 340Kgates. The
main contributer are the 8 complex multipliers. It can be expected that lower gate counts can be achieved
with better synthesis tools.
1 Hosted IO does work when there is a debug-client.

Glossary 22
Glossary
ASIP : Application Specific Processor
DFT : Discrete Fourier Transform
FFT : Fast Fourier Transform
II : Initiation Interval
The number of cycles of a loop body.
ILP : Instruction Level Parallelism
ISS : Instruction Set Simulator
LTE : Long Term Evolution

The project name of a new high performance air interface for cellular mobile communication systems
(4th generation).
LTEA : LTE-advanced
OFDM : Orthogonal frequency-division multiplexing
PFA : Prime Factor Algorithm
RCD : Register Change Dump
SIMD : Single Instruction Multiple Data

SNR : Signal-to-Noise Ratio

REFERENCES 23
References
[1] Rounding. http://en.wikipedia.org/wiki/Rounding.

[2] S. Burrus. Fast fourier transforms. Technical report. http://cnx.org/content/col10550/latest/.
[3] Ericsson. Dft size for uplink transmissions. Internal report R1-062852, Oct. 2006.
[4] D. Hale. C source for prime factor algorithm (pfa) fft. Technical report, Colorado School of Mines,
1989. http://www.jjj.de/fft/fftpage.html.
[5] James S. Plank, University of Tennessee. jgraph. http://www.cs.utk.edu/∼plank/plank/jgraph/jgraph.html.
[6] Target Compiler Technologies. IP Designer. http://www.retarget.com.
[7] Target Compiler Technologies. Installation manual, June 2009. Release 09R1.

Primecore 20100222

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Primecore 20100222

Uploaded by

Copyright:

Available Formats

Design of an ASIP for DFT/FFT

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

2.1 DFT/FFT sizes

2.2 Algorithm development

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

Table 1: DFT sizes that must be supported.

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

#call #ld #st #mul #add/sub

Table 2: Operation count statistics for DFT size 720.

2.3 Scalar operation count estimation

2.5 Special operations

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

Figure 1: DFT vectorization.

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

// complex butterfly with rotated second operand

// complex butterfly with preshift

2.6 Reducing load/stores

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

2.7 Vector operation count estimation

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

no size stage0 stage1 stage2 vload/vstore-data vmultiplications vbutterflies primecore

2.8 Fixed point and data scaling

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

2.9 Source code structure of modules

All modules are build up similarly to the following one.

vcbf0(i0,t5, o0,t7, /* scaling: */ SCALE*2,2,SCALE);

void vmod5_2(vcmplx_t vIn[], vcmplx_t vOut[], int chess_storage(m0) n)

vcmplx_t* restrict p0 = pOut; p0[0*stepn] = o0;

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

vcmplx_t* restrict p1 = pOut; p1[1*stepn] = o1;

2.10 Guiding compiler

3.1 Overall structure

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

Figure 3: Primecore architecture.

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

// for rdx8 only:

// for rdx8 only:

cbf0_P22( y1A,y7A, o7A,o1A, /* scaling: */ 0);

wordP2 mul0i_irdx(uint1_t rdx8, wordP2 a, wordP2 b)

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

4.1 Cycle count

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

397 nop ; nop ; v03=dm[p0+6*m0] ; nop ; p2=#0

module min-II scheduled-II %

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

scalar Fus 2% memIf 2% rest 0%

Figure 5: Gatecount distribution for Primecore.

frequency gatecount area

Table 5: Gatecount for at various frequencies.

4.2 Synthesis results

4.3 Code size

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

component size (#words) estimated bitwidth estimated size (in bits)

Table 6: Code size for the DFT/FFT algorithm.

Comparison min-SNR avg-SNR max-SNR

4.4 Signal to noise ratio

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

5.1 Native compilation and execution

5.2 On target compilation and execution

February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary

Input signal double precision DFT (ref)

0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150

vcbf0(i0,t5, o0,t7, /* scaling: / SCALE2,2,SCALE);