Professional Documents
Culture Documents
Erik Brockmeyer
Target Compiler Technologies
Technologielaan 11-0002, B-3001 Leuven, Belgium
Februari, 2010
Version: 0.1
Abstract
The Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) are typical functions used
in the implementation of wireless standards. Examples of such standards are WiMax and LTE. These
standards support a wide range of symbol sizes that all require a different size DFT/FFT. Especially
the DFT algorithm can become complex if an efficient implementation is required. The Prime Factor
Algorithm (PFA) algorithm can efficiently implement a DFT of which the size can be expressed as a
product of prime-factors. Unfortunately, the modules of which this algorithm consists are irregular.
Therefore, a flexible and programmable architecture is key to enable all these standards to be build
efficiently. Unfortunately, a standard processor does not have sufficient compute power to deliver a high
enough throughput. An Application Specific Processor (ASIP) allows to tune your processor for a certain
task such that it can meet the throughput [6].
The designed ASIP, named Primecore, is capable of delivering a throughput of 1 input sample per
cycle at a frequency of 500MHz. This result has been obtained by vectorizing the code and reducing the
loop overhead to a minimum. Special instructions have been added to efficiently execute the butterfly
patterns in a Single Instruction Multiple Data (SIMD) style. Also special load and store instructions
have been constructed to realize a memory bandwidth that matches the requirements of the data path.
Instruction Level Parallelism (ILP) was used to combine the data path and load/store operations into
parallel instructions. The Chess retargetable C compiler supports the programming of this application
specific architecture and exploits the available ILP.
The total gate count of the Primecore design, obtained for a clock frequency of 500 MHz and for a
65 nm technology, is 340kgates.
Contents
1 Introduction 3
2 Application 3
2.1 DFT/FFT sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Algorithm development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Scalar operation count estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Special operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Reducing load/stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Vector operation count estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Fixed point and data scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 Source code structure of modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 Guiding compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Architecture 12
3.1 Overall structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 15
4.1 Cycle count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Code size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Signal to noise ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Evaluation 19
5.1 Native compilation and execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 On target compilation and execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Verilog generation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Future work 21
7 Summary 21
Glossary 22
Bibliography 23
1 Introduction
The primecore has been developed for use in the LTE and 4G base stations. These stations require a high
throughput for algorithms like Fast Fourier Transform (FFT) and Discrete Fourier Transform (DFT). The
target throughput is 1 cycle per complex input sample. At the same time, the architecture should be able to
run at a high clock frequency (500Mhz).
The uplink transmission scheme selected for Long Term Evolution (LTE) is SC-FDMA, also known as DFT-
spread Orthogonal frequency-division multiplexing (OFDM). The DFT is one of the most demanding blocks
in the communication scheme. A straight forward implementation would require many parallel multipliers
to meet real-time. It is known that efficient implementation of a DFT is possible if the size of the transform
can be factorized in to a small number of (prime) numbers (Prime Factor Algorithm (PFA) [2]). The PFA
algorithm was envisioned at the time when the LTE standard was defined. The PFA requires much less
resources, but has a much less regular control flow and consists of many different cases. Therefore an
Application Specific Processor (ASIP) looks like a promising solution to implement this type of algorithm.
The remainder of this document is build up as follows. Section 2 will discuss the application, its charac-
teristics and the performed optimizations. Section 3 explains the primecore architecture. Section 4 reveals
the cycle count and gate count results. Some guidance to evaluate the primecore and IP designer is given in
Section 5. Open issues and future work are discussed in Section 6. Finally, a short summary can be found
in the last section.
2 Application
The PFA algorithm envisioned when defining the LTE standard. The smaller the prime numbers in the
factorization are, the simpler the implementation. To simplify the implementation, it is therefore proposed
to limit the uplink scheduling grants to allocations corresponding to DFT precoding sizes that can be written
as a product of the numbers 2, 3, and 5[3], where the power of the prime number 5 is at most 2. The DFT
size must also be a multiple of 12 in LTE (agreed at RAN1#46). In the more recent LTE-advanced (LTEA)
standard, this is reduced to 6. The maximum DFT size is 1296 carriers. As a result the DFT-sizes of Table 1
must be supported.
Next to the DFT-sizes, the FFTs (which are powers of 2) in the range 8 to 2048 must be supported.
The PFA implementation of Colorado School of Mines is used as initial/reference implementation [4].
In this code, the DFT size must be factorable into mutually prime factors taken from the set
2,3,4,5,7,8,9,11,13,16. In other words: n = 2 p ∗ 3q ∗ 5r ∗ 7s ∗ 11t ∗ 13u where 0 <= p <= 4, 0 <= q <= 2,
0 <= r, s,t, u <= 1. Obviously, the larger prime factors are not needed for LTE, and the lower primes
have too low factors. Building modules larger than 16 should be avoided as this will result in a very large
code-size when the modules are flattened in the source code.
Like an FFT, the PFA algorithm can be executed in multiple stages. In between two stages a twiddle
multiplication must be performed. The same prime factor cannot be performed within the same stage,
but it can in different stages. So we can perform 3 consecutive PFA stages to support the following sizes:
n = 2 p ∗ 3q ∗ 5r where 0 <= p <= 12, 0 <= q <= 6, 0 <= r <= 2. This is sufficient for the required sizes
of Table 1.
no size k in 2k l in 3l m in 5m
1 6 1 1
2 12 2 1
3 18 1 2
4 24 3 1
5 30 1 1 1
6 36 2 2
7 48 4 1
8 54 1 3
9 60 2 1 1
10 72 3 2
11 90 1 2 1
12 96 5 1
13 108 2 3
14 120 3 1 1
15 144 4 2
16 150 1 1 2
17 162 1 4
18 180 2 2 1
19 192 6 1
20 216 3 3
21 240 4 1 1
22 270 1 3 1
23 288 5 2
24 300 2 1 2
25 324 2 4
26 360 3 2 1
27 384 7 1
28 432 4 3
29 450 1 2 2
30 480 5 1 1
31 486 1 5
32 540 2 3 1
33 576 6 2
34 600 3 1 2
35 648 3 4
36 720 4 2 1
37 768 8 1
38 810 1 4 1
39 864 5 3
40 900 2 2 2
41 960 6 1 1
42 972 2 5
43 1080 3 3 1
44 1152 7 2
45 1200 4 1 2
46 1296 4 4
The dimensioning of the primecore processor is based on the operation count statistics. An initial rough
operation count estimation (+, * and ld/st) is made by instrumenting the single stage scalar code. To give an
idea of the required throughput we report the operation count statistics for the DFT-size of 720 in Table 2.
The numbers do not count address computation and are without call or loop overhead. Also, it must be
noted the operation count will increase further if multiple stages are required (the in between twiddles
require loads and multiplications). The average scalar operations/cycle line shows how many parallel
operations we will have to foresee in the architecture. Over 40 parallel operations must be performed per
cycle.
From these statistics it is clear that we have to maximize the use of Single Instruction Multiple Data (SIMD)
to reach the desired throughput. The maximum common factor over all DFT sizes is 6. Moreover, most data
is loaded/stored as a complex data. Hence, our initial refinement is a vector machine of 6 complex elements
(which totals the number of parallel elements to 12). In the ideal case, an Instruction Level Parallelism (ILP)
of 4 will be needed to achieve a total parallelism of 40 operations per cycle.
2.4 Vectorization
As mentioned before, a DFT can be build up in multiple stages (like an FFT). The common factor 6 in all
DFT sizes allows to split off a 6 element part. In this way, the whole DFT is decomposed in 6 sub- DFT s that
have internally the same data flow (see left hand side of Figure 1). Each sub word will be in a different
sub-DFT (the red arrows in front of the sub-DFT are one vector). The second (split off) part processes the
part among the sub words. A special data path operation is added to perform the final radix-6 butterfly.
The vectorized code becomes very regular an no data repacking is required in the DFT. The right hand
side of Figure 1 shows the data-flow were we use vectorized butterfly operations. All inter-subword-
dependencies are implemented in the irdx6 data path operation.
The same vectorization technique can be applied to an FFT. However, the common factor isn’t 6. Either a
common factor of 4 or 8 must be used. It must also be taken into account that the largest FFT is up to 2K.
Therefore the number of subwords have been expanded to eight. Similar to the DFT, an irdx8 primitive has
been added to vectorize the code.
Instead of building separate irdx6 and irdx8 data paths, we can build an irdx data path operation that executes
either one, depending on a mode bit. This enables to reuse code. The example of Figure 1, was executing a
24-point DFT. By switching the mode to irdx8 the same code is performing a 32-point FFT (of course, also
different twiddle factors are required).
Instead of processing separate adds and subs it is much more efficient to specialize the operations. Butterfly
operations and special multiply operations will be introduced. These special purpose arithmetic operations
are called primitive functions in IP Designer.
Most additions/subtractions are happening in a butterfly. The same two values are added and subtracted.
A combined primitive will be faster as it does the two operations in the same cycle. Moreover it also avoid
part1
part2
sub-dft0
irdx6 v6_cbf v6_cbf irdx6
twiddle
v6_cbf
irdx6
twiddle
sub-dft1 irdx6
v6_cbf
irdx6
twiddle
sub-dft2 irdx6
irdx6
twiddle
sub-dft3 irdx6
twiddle
sub-dft4
twiddle
sub-dft5
that the same values must be read twice from the register file. The latter will reduce the number of register
ports. Two types of butterflies can be identified: vcbf0 and vcbf1. The difference is that the second operand
is rotated 90 degrees. The basic functionality is described as follows (in Section 2.8 we will add scaling to
the primitive):
// complex butterfly
inline void vcbf0(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1)
{
o0 = vcmplx_t(i0.vreal()+i1.vreal(),i0.vimag()+i1.vimag());
o1 = vcmplx_t(i0.vreal()-i1.vreal(),i0.vimag()-i1.vimag());
}
The number of parallel load stores will dictate the amount of memory ports. Preferable the number of
memory ports is low to ease the IP integration.
A straight forward code reads and writes the data at a module boundary and twiddle boundary. Let’s
consider the DFT size 900. It would require three stages: 30, 5 and 6 (the final stage is always 6, and there
can be only one factor 5 per stage). The first stage would be decomposed in the modules: 2, 3 and 5. As
a result, we load and store the data 14 times (see 7 loads + 7 stores in Figure 2a). Taking into account that
every load/store is a vector of 6 elements and we target a throughput of 1 cycle/element, we can derive the
we can load/store the data in total maximally 6 times (assuming zero overhead).
The number of load stores can be reduced by merging functions. An obvious improvement is to merge
the twid 1to2 computations with the irdx stage. Similar, the twid 1to2 computations can be merged into the
module before, or after. Note, that this increases the code-size as all modules must be implemented with
and without twiddle computation (adding a condition in the inner loop is not an option as this will result in
loss of performance). Similarly, we can consider to merge the twid 1to2+irdx and a module (see Figure 2b).
twid_0to1
twid_1to2
a)
module2
module3
module5
module5
ld st ld st ld st ld st ld st ld st ld st
irdx6
twid_0to1
twid_1to2
b)
module2
module3
module5
module5
irdx6
twid_0to1
twid_1to2
c)
module2
module5
module3
module5
irdx6
Figure 2: a)Every element is loaded and stored 14 times for a straight forward implementation of a DFT
size 900. b) merging the twiddles and irdx reduces the load/stores to 8 times. c) smart merging of modules
reduces the load/stores to 4 times.
A further reduction in load/stores can be achieved by merging modules. Here for instance you can merge
the module2 in the stage0, while the module3 is moved to (and merged) the stage1. The load/stores are
avoided by keeping the data in a local register file. Care should be taken that the modules don’t get too big.
Preferable we keep them under 16 such that we can limit the register file. Moreover, smart merging can
avoid that all combinations of sizes with and without irdx/twiddle must be made. A last freedom that can
be taken into account to optimize the combinations is the size of the final stage (irdx6 or irdx8).
The left hand side of table 3 shows the result of the module merging optimization. Note that stage2 is the
irdx primitive and it is merged together with stage1. As a result, all data is read/written maximally 4 times
for any DFT size. Note however, the module code is different when used in stage0 or in stage1. Though
further code replication is avoided by the irdx6/irdx8 mode-bit. Also it can be observed that the code size
is also limited by only having in stage0 the modules 5,8,9 and 10. This table has also been coded in the
application source code. It is a lookup table to select the correct functions.
The FFT can be implemented with the same structure. A module16 has been added to stage0 to implement
all needed FFT sizes.
A refined operation count statistics is measured after the source code optimizations of the previous sub-
sections. Based on these new statistics it is be feasible to build an architecture that meets the real-time
constraint. The statistics are given in the middle part of Table 3.
We have measured the average number of load/stores, multiply and butterfly operations. The irdx and
loading of coefficients are not measured as they will always be lower than the number of stores (and hence
a less critical resource). Based on the average number of operations per cycles statistics we could conclude
that one unit per type could be sufficient. However, we should be careful for the following pitfalls:
• Loop overhead. The modules contain a loop that iterates over the data. Software pipelining (or loop
folding) will be applied by the compiler to exploit the ILP. However, the additional cycles to setup
the loop can be significant for low iteration counts. In our case we have low iteration counts ranging
from 1 to 18. The loop overhead can be reduced by allowing delay slots instructions after the loop
setup instruction.
Table 3: Left: DFT/FFT stage sizes. Middle: Statistics for ld/st, mult and butterflies in optimized code.
Right: cycle measurements on primecore.
February 23, 2010 Target Compiler Technologies NV Confidential and Proprietary
2 APPLICATION 10
• Module switching overhead. A switch statement can have a significant overhead (easily tens of
cycles). This can be a significant overhead for the smaller DFTs. Simply inlining all modules is not
a solution as the code size will explode. Therefore we propose to use function pointers. These allow
fast switching to any module function. For the smaller DFT sizes we still can specialize the code to
further reduce the overhead.
• Loading module coefficients. Some of the modules require coefficients, which need to be generated.
Generating constants from immediates values in the instruction word is undesired as it increases the
length of the instruction word. The better alternative is to load coefficients from a memory. In that
case the overhead of loading these coefficients should be reduced as much as possible. This can be
done by packing multiple coefficients in a vector.
• Limited scheduling freedom. Even when we have sufficient functional units, it can very well be that
the operations cannot be scheduled efficiently due to data dependencies. It can also be that unbalance
in the loops causes a suboptimal scheduling (for instance, the first module has many load stores, while
the second module needs many butterflies). The scheduling freedom can be improved by introducing
indexed load/stores (for data and coefficients). All module accesses are of the following pattern
p[# ∗ n] (where # is an immediate and n a variable). The compiler can then reorder the indexed
accesses such that they match the schedule. An efficient architecture is best matching this pattern to
exploit the scheduling freedom.
The initial processor architecture we have in mind has 5 parallel slots:
1. One vload/vstore unit for data.
2. One vload unit for coefficients.
3. One multiplier unit for vector-mul, complex-vector-mul and vsummul.
4. One butterfly unit.
5. One irdx unit.
More accurate results can only be obtained by working out the architecture and compiling the code.
The vectorization and merging optimizations have been performed on double precision code. Checking the
correctness of the source code in double precision is easier as the error in the output signal should be very
small. Quantization errors are introduced when making the code fixed point. Some effort has been spent
to reduce the quantization error by adding scale and round functions in the butterfly and irdx primitives. The
scaling will result in a higher gate count but will yield to a better Signal-to-Noise Ratio (SNR). The current
primitive allows independent scaling of the input. And the outputs are scaled together. To be concrete, the
scaling version of vcbf0 is as follows:
base_t sh2mul[] = { 1.0, 0.5, 0.25, 0.125 };
void vcbf0(vcmplx_t i0, vcmplx_t i1, vcmplx_t& o0, vcmplx_t& o1,
/* scaling factors: */ int si0, int si1, int so)
{
vcmplx_t t0 = i0*sh2mul[si0];
vcmplx_t t2 = t1*sh2mul[si1];
o0 = (t0+i1)*sh2mul[so];
o1 = (t0-t2)*sh2mul[so];
}
Of course, the scaling will be implemented as a shift in hardware. In fact, we have tried multiple rounding
schemes. Especially the output scaling is sensitive to rounding error accumulation [1]. Therefore we use a
round to even function for the output scaling and a round down function for the input scaling.
Having an appropriate scaling/rounding can save quite some hardware-bits and hence make the architecture
cheaper. An ASIP allows to flexibly design both the rounding instructions and the bit-width to come up with
the optimal solution.
While developing the code it is interesting to keep the double precision version as a reference. The appli-
cation code has been setup in such a way that native compilation can be done in double precision and in
fractional mode.
y3 = vsummul(t3,t4,cm1);
y4 = vsummul(t3,t4,cm2);
vcbf1(y1,y4, o4,o1, /* scaling: */ 0,0);
vcbf1(y2,y3, o3,o2, /* scaling: */ 0,0);
}
i0 = pIn[0*stepn];
i1 = pIn[1*stepn];
i2 = pIn[2*stepn];
i3 = pIn[3*stepn];
i4 = pIn[4*stepn];
pIn++;
vmodule5io(i0,i1,i2,i3,i4, t0,t1,t2,t3,t4,
select_tc(c,0),cselect_tc(c,2),cselect_tc(c,0));
t0 = t0 * pTwid[0*stepn]; o0 = irdx(t0,mrRS);
t1 = t1 * pTwid[1*stepn]; o1 = irdx(t1,mrRS);
t2 = t2 * pTwid[2*stepn]; o2 = irdx(t2,mrRS);
t3 = t3 * pTwid[3*stepn]; o3 = irdx(t3,mrRS);
t4 = t4 * pTwid[4*stepn]; o4 = irdx(t4,mrRS);
pTwid++;
By guiding the compiler with pragmas the cycle budget can be reduced. The pragmas were tuned after
building the architecture. We have done the following:
• Stating that a loop has at least 1 iteration, with the pragma chess loop range(1,). This avoids that the
compiler inserts a condition to check for minimally 1 iteration.
• Stating were the function parameter is stored with the pragma int chess storage(m0) n. This avoids a
move from an argument register to the M register.
• The compiler was suboptimal in assigning the input data to registers. This is more difficult, because
the set of registers that can load data is limited. Therefore we have assigned them manually for
the problematic modules as follows: vcmplx t CHESS STORAGE MOD9(V0) i0;. We are investigating to
improve the compiler on this issue.
• Restrict points to give scheduling freedom to the compiler (also for the individual writes).
3 Architecture
The Primecore architecture has been implemented to match the data flow of the modules (see Section 2.9).
A diagram of the vector related part is given in Figure 3. There are 2 memories: DM for the data, and
CM for the coefficients. The main register file is V[24]. It can store all the temporary data for a module
without having to spill that data to memory. An effort is made to limit the number of ports on this large
register file and thus to keep the gate count low. The two read ports to the VEC 0 unit can only read from
half the register file. Also the loads from memory and the results from the multiplier have a limited range
to write to (shown as the green and purple part). The X[6] register file contains the data that goes to the
multiplier. The (twiddle) constants for the multiplier are stored in T. The TC register is meant for the longer
term constants, while TW is meant for the short lifetime twiddle-factors. The VSEL 0 unit allows to select
the required constant. The multiplied results can be either written back to the V register file or to the RDXI
register to be processed for the final stage. The VEC 1 unit performs the final irdx function and its output
is written back to memory via the RDXO register. The result can be stored from the RDXI register if the
module is in the first stage.
DM CM
V[24]
VA[12]
X[6]
VB[12]
T[2]
RDXI
TC
TW
vsel0
vec0 vec1
vmpy0
(butterfly) (irdx)
(vcmul, vmul, vsummul)
RDXO
Remarks:
• The processor has 5 parallel ILP issue slots. The three functional units process in parallel. Moreover,
the two memories can load/store data in parallel.
• To meet the required frequency of 500MHz we have used a pipelined multiplier.
3.2 Primitives
Coding the primitives is done in a C-like language with the PDG tool. For instance the butterfly primitive
can be coded as follows.
void vcbfly0(vcword vi0,vcword vi1, vcword& vo0, vcword& vo1,uint2 f0,uint2 f1,uint2 f2)
{
vcword r0,r1;
vcword t0 = vcscale(vi0,f0);
vcword t1 = vcscale(vi1,f1);
for (int32_t i = 0; i < VSIZE; i++) {
r0[i] = cmplx(ext_re(t0[i])+ext_re(vi1[i]), ext_im(t0[i])+ext_im(vi1[i]));
r1[i] = cmplx(ext_re(t0[i])- ext_re(t1[i]), ext_im(t0[i])- ext_im(t1[i]));
}
vo0 = vcscale(r0,f2);
vo1 = vcscale(r1,f2);
}
The only primitive that mandates a more detailed explanation is the irdx primitive. Special care is taken to
avoid duplication of resources and a long critical path. The irdx6 and irdx8 are written down in one function
to enforce resource sharing. Also, the fixed multiplications in these primitives are written down as a sum
of shifts to avoid the instantiation of a multiplier. Even the summing has been written down to share the
adders (in mul0 irdx). The whole primitive is computed on 18-bits precision, and the result is rounded at the
end. The irdx primitive has become in this way a small functional unit that meets the timing constraint of a
500 MHz clock.
vcword irdx(vcword a,uint1 rdx8)
{
cwordP2 dummy_P2 = (cwordP2)0;
cwordP2 i0 = cnvt_cwordP2(a[0]);
cwordP2 i1 = cnvt_cwordP2(a[1]);
cwordP2 i2 = cnvt_cwordP2(a[2]);
cwordP2 i3 = cnvt_cwordP2(a[3]);
cwordP2 i4 = cnvt_cwordP2(a[4]);
cwordP2 i5 = cnvt_cwordP2(a[5]);
cwordP2 i6 = cnvt_cwordP2(a[6]);
cwordP2 i7 = cnvt_cwordP2(a[7]);
uint1_t rdx6=1-rdx8;
cwordP2 t1A,t2A,t3,t4,t5A,t6A,t7A,t8A,t9A,t10A,t11A,t12A;
cwordP2 y1A,y2A,y3A,y4A,y5A,y6A,y7A;
cwordP2 o0A,o1A,o2A,o3A,o4A,o5A,o6A,o7A;
cbf0_P22( i1, i5, t3, t4, /* scaling: */ 0);
cbf0_P22( i2, rdx8?i6:i4, t5A, t6A, /* scaling: */ 0);
cbf0_P22( i0, rdx8?i4:t5A, t1A, t2A, /* scaling: */ rdx6);
cbf0_P22( i3, rdx8?i7:t3, t7A, t8A, /* scaling: */ rdx6);
cbf0_P22( t1A,rdx8?t5A:t7A, t9A, y2A, /* scaling: */ 0);
cwordP2 m1,m2;
m1 = mul0_irdx(rdx8,t11A,t6A);
m2 = mul1_irdx(rdx8,t12A,t4);
t6A = cmplxP2(ext_im_P2(t6A),-ext_re_P2(t6A));
cbf0_P22( t2A,m1, y1A,y3A,/* scaling: */ 0);
cbf0_P22(rdx8?m2:t8A, rdx8?t6A:m2, y7A,y5A,/* scaling: */ 0);
vcword r;
r[0] = cscale_P2(rdx8?o0A:t9A,2);
r[1] = cscale_P2(rdx8?o1A:o3A,2);
r[2] = cscale_P2(rdx8?o2A:o7A,2);
r[3] = cscale_P2(rdx8?o3A:y2A,2);
r[4] = cscale_P2(rdx8?o4A:o5A,2);
r[5] = cscale_P2(rdx8?o5A:o1A,2);
r[6] = cscale_P2(rdx8?o6A:dummy_P2,2);
r[7] = cscale_P2(rdx8?o7A:dummy_P2,2);
return r;
}
{
wordP2 ina = a;
wordP2 inb = b;
wordP2 inbr = ~inb;
wordP2 part0 = rdx8?(ina>>1) :(inbr);
wordP2 part1 = rdx8?(ina>>3) :(inb>>3);
wordP2 part2 = rdx8?(ina>>4) :(inb>>7);
wordP2 part3 = rdx8?(ina>>6) :(inb>>10);
wordP2 part4 = rdx8?(ina>>8) :(inb>>12);
wordP2 part5 = rdx8?(ina>>14):1;
wordP2 r1 = (part0+part1+part2+part3+part4+part5);
return r1;
}
// similar to mul0i_irdx are:
wordP2 mul0r_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
wordP2 mul1i_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
wordP2 mul1r_irdx(uint1_t rdx8, wordP2 a, wordP2 b)
4 Results
This section will first discuss the cycle counts, followed by synthesis results. The final section deals with
the code size.
The cycle count requirements are met by designing an architecture with sufficient parallelism. The compiler
has to find a schedule to use all resources efficiently. We determined the cycle count underbound for the
Primecore architecture. We will show that the compiler comes very close to this underbound.
The cycle counts for the DFT/FFT sizes are given in the right hand side of Table 3. The cycle budget is met
for all sizes. For some of them, we even have up to 50% slack remaining. This is because the architecture
is designed for the worst case, and some modules are simpler to implement then others. The main reason
for slack is that we can use the SIMD-8 hardware for some DFT sizes.
The cycle count is for a very large part determined by the scheduling of loops. The loop folding transfor-
mation schedules multiple iterations of a loop together to exploit the available ILP. The Initiation Interval
(II) is the most important figure of merit, it is the number of cycles required for 1 iteration. The folding
requires the creation of a pre/post able to the loop. The cycles in the pre/post amble can be significant when
the iteration count is low.
Figure 4 shows the scheduled result of vmod10. The red arrow at the left shows the part that is repeated
by the loop (from 429 to 410). The II of the loop is 20 cycles. This is the minimum for this loop on
this architecture, as 20 loads and stores must happen (see the red box). The loop is issued by instruction
407. The instruction 408 and 409 are, so called, delay slots. These delay slots can issue useful pre-amble
instructions, but are not yet part of the loop. Delay slots are required because it takes a few cycles to
initialize the registers that control the zero overhead loop.
Most modules have a minimum cycle count caused by the load/store unit. Every element must be loaded
once, and stored once. So the module10 requires 20 load/stores. Only the large modules are limited by
the number of butterflies. The module16 and module18 require respectively 34 and 46 butterfly operations.
Table 4 shows how close we are to the minimum-II for most modules. Also by inspecting the schedules
you can observe that a minimum of cycles is lost in the preamble.
The most important conclusion that we can draw is that the available hardware resources are used effi-
ciently.
{vmod10_s1 void_vmod10_s1___Pvcmplx_t___Pvcmplx_t___sint}
Figure 4: The scheduled vmod10 reaches the minimum II of 20. The critical resource are the 20 data
memory load/stores.
Table 4: Minimum and actual II of loops of the modules. Nearly always the minimum is reached.
dec+ctrl 2% reg_V
29%
vmpy
36%
reg_X
8%
other vec_regs
3%
vec1 (irdx) vec0 (bf)
5% 10%
An RTL implementation has been generated of the architecture presented in Section 3. Our default synthe-
sis script has been applied to synthesize the core. Used tools and libraries:
Synopsys: synC-2009.06-SP4
Lib: tsmc065gp
Since we do not have a full-option synthesis tool we have suboptimal results. In particular, we need a
pipelined multiplier to meet the target frequency of 500MHz, but our Designware library dies not contain
one. In our synthesis experiments we have inserted our own pipelined multiplier as our synthesis tools do
not have retiming enabled. However, it is best to use the retiming option of Synopsis to obtain an optimal
pipelined multiplier. Therefore the numbers in this section should be interpreted as rough indications,
which can be improved with better synthesis tools and libraries.
The architecture is sliced to ease the synthesis. Slicing allows to the synthesis tool to process the architec-
ture per SIMD element rather than synthesizing the whole at once. The main advantage is faster synthesis
results.
The design synthesizes into 320kGates at 500Mhz (other frequencies can be found in Table ??. The gate
count distribution is shown in Figure 5. The two dominant contributers are the pipelined vector (complex)
multiplier and the register file V. The vector multiplier contains 4×8 scalar multipliers of 16×16→32 bits.
These will reduce in size when using a better synthesis tool and a better library. The register file V cannot
be reduced in size without affecting the cycle budget. It must be able to hold the data for a complete
module.
The program memory size is (obviously) highly dependent on the whole program. Counting the testbench
program that is loading/storing data via hosted IO is not intresting. Table 6 reports 3 components: modules,
Table 7: S NR for the DFT/FFT algorithm (min/average/max over all sizes). Real and imaginary measured
separately.
control and initialization code. The code size is the number of words of the compiled code. The modules
form the bulk of the code. A lot of care was taken to minimize code duplication while maximizing the
performance.
Currently, the instruction encoding is generated automatically. The instruction word is 91 bits. A rough
estimate is that we reduce it to 64 bit by manually optimising the instruction encoding. The not parallel
instructions can even be reduced further to 32-bit, or lower. A combined long/short instruction set can
achieve such code size reduction. The IP-designer tool suite supports these combines instruction sets.
An projection of the overall code size is given in the last column of Table 6. It can be expected that the
control code and initialization code will grow when embedding the application in its context. Also then,
the combined instruction set allows to limit the overhead.
The signal to noise ratio is an important quality measure. The bitwidth of the processor contributes to the
quality of the signal. But even, as important is the used algorithm and rounding. Finally, last but not least is
the signal itself. A small signal will never obtain a good signal to noise ratio. We have done some limited
measurements to get a feeling. A more thorough measurement is required though.
The generated test-input-signal is the summation of some sinus signals (see function generate input, an ex-
ample signal is also shown in Figure 6). This will result in some peaks in the frequency domain. We use
a double precision DFT as a reference signal (labeled “ddft”). The signal degradation will be different for
the various DFT sizes. Therefore we report a minimum, maximum and an average. We compute the SNR
for the following cases:
• Quantized DFT signal (labeled “qdft”). Some error is already made by quantizing the output signal.
• Fractional straight forward DFT (labeled “dft”). This DFT is computed with 16-bit precision inter-
nally. The summation is a binary tree which scales the sum with 1 bit every level. This is the most
fair comparison algorithm. Scaling before summation results in a poor SNR while summing in 32-bit
precision and scaling afterwards uses a much higher precision.
• PFA-DFT (labeled “pfa”).
Table 7 reports the real and imaginary SNR separately for the above mentioned cases. The quantized DFT
has an average SNR of about 80dB. About 6dB is lost when the internal precision is made 16-bit. Our PFA
based computation looses only 3dB extra. More detailed information is reported when running the native
code (see Section 5.1).
5 Evaluation
This section gives some guidance about how to evaluate the primecore. Through a tutorial-like description,
we will guide you through the various tasks, i.e. to compile and run the code natively, to compile and run
the code on the target processors, to build an Instruction Set Simulator (ISS), and to generate and synthesize
V ERILOG. The processor description can be found in the primecore directory, while the application code
can be found in the vpfa directory. These scripts have been tested under Suse Enterprise Linux 11 gcc4.3.2.
It is assumed that the IP Designer tool-suite is installed and a license is available [7].
During the development we have used different data types. The initial code has been developed in double
precision. Then a fixed point fractional type has been used. This type has been refined to the 16-bit bit-
true type of the primecore. Since, the current application code is probably not the end-point we have kept
compilation for the double precision and fractional type.
The native test bench generates an input signal by summing a number of sinus and cosinus signals of
different frequency and amplitude (real and imag). This should result in some peaks at the corresponding
frequency location.
The application code can be compiled in double precision, executed and tested with the test double.sh
script. It should execute a straight forward DFT and a PFA-DFT. The program compares the results and
sums the total error. The final output should report that there are no differences:
TOTAL error_pfa_dft =(0.000000,0.000000)
TOTAL error_pfa_ddft=(0.000000,0.000000)
TOTAL error_dft_ddft=(0.000000,0.000000)
Similarly, the fractional type can be used with the test fract.sh script. This application computes the DFT
in double precision as a reference (named “ddft”). It also computes the a straight-forward DFT using the
fractional type (named “dft”). The quantization during the fractional multiplications will show the minimal
signal distortion for quantizing. Finally, the PFA-DFT is executed. The error among these 3 functions
is determined and printed. As can be expected, the error will increase when performing rounding and
scaling. The signals and errors can be dumped to files for one DFT of your choice for further inspection (by
setting #define DUMP_DFT_SIZE 162). The dumped files can be visualized with jgraph [5] or another
application. Figure 6 gives an idea of the output for the jgraph application. The jgraph command is as
follows:
jgraph test.jgr > out.eps ; ghostview out.eps
The fractional type has 32-bit precision internally and wrapping or clipping should not cause problems.
However, this will be a problem when using the 16-bit type of the processor. The rounding in the application
source should prevent such overflows. The fractional type evaluates what is the minimum and maximum
value used. The overflow problems should be solved before using the bit-true type.
The final native simulation is using the primitives of the processor and will therefore compute the bit-true
output when running the application on the primecore (execute test bittrue.sh). The output should be
equal to the previous simulation if no overflow had occurred. Note that you have to build the ISS primitives
first, before you can run this script (so run the update_proc.sh script first).
Generating an input signal is difficult and time consuming on the Primecore (sin/cos computation is not
trivial). Therefore we use the native testbench to generate fixed-point input data and reference data for all
sizes. Hosted IO is used on the ISS to load the input, coefficients and reference data and save the output
data.
ddft-REF-im
ddft-REF-re
40
input-im
input-re
0.0 0.0 20
20
-0.5 -0.5 0
0
diff_ddft_qdft.im
diff_ddft_qdft.re
0.01 0.01
40
qdft-im
qdft-re
20 0.00 0.00
20
0 -0.01
quantization noise -0.01
0
40
20 0.00 0.00
20
0.00 0.00
-0.01 -0.01
An ISS must be build before starting simulation. The following script updates the ISS and all runtime
libraries for the current processor model.
cd primecore/lib; ./update_proc.sh
The measure.sh script can be used to compile the application code and run it on the ISS. The script also
verifies if the produced output data corresponds to the native computed output. Finally, it reports the timing
measurements.
Generating an input signal is difficult and time consuming on the Primecore (sin/cos computation is not
trivial). Also the hosted IO is not supported for RTL simulation 1 . Therefore another testbench code has
been made that compiles the input data and reference data into the executable. The application code is first
run on the ISS logging all register changes. In a second step, the same code is run on the simulated RTL
code. The RTL code creates also a Register Change Dump (RCD). The two RCDs should be equal. The
following script executes the whole process:
cd primecore/hdl ; ./test_hdl.sh
6 Future work
The primecore has the basis functionality to perform the DFT/FFT algorithms. However, some further
tuning of the core is required. Especially, the context in the system has to be refined. Possible next steps
are:
• Synchronization with the rest of the system (fifo communication?).
• Coefficient generation is offline.
• The rounding requires further investigation.
• Optimising the instruction encoding to reduce the instruction word (see Section 4.3).
7 Summary
The Primecore shows the potential of the IP Designer tool suite in the context of LTE base stations. Proces-
sors can be used to meet the requirements to efficiently implement the FFT and DFT algorithm. Convincing
evidence is given that a tuned architecture can be build that meets the tight cycle budget at high clock
frequency constraints. A 6/8-way SIMD machine has been build. Special instructions have been build to
efficiently process butterflies with scaling. Also dedicated instructions for performing a radix-6 and radix-8
among the SIMD elements have been added. For all symbol sizes we meet a performance that is better than
1 cycle per complex input sample. For some sizes we even have quite some slack. This slack can used
to achieve an even higher throughput, or to power down the processor each time a certain frame has been
processed. The pipelined architecture can run over 500Mhz. The total gate count is around 340Kgates. The
main contributer are the 8 complex multipliers. It can be expected that lower gate counts can be achieved
with better synthesis tools.
Glossary
II : Initiation Interval
The number of cycles of a loop body.
ILP : Instruction Level Parallelism
ISS : Instruction Set Simulator
References