You are on page 1of 1

A 2048 complex point FFT processor for

acceleration in Digital Holographic Imaging


Thomas Lenart and Viktor Öwall

Digital Holographic Imaging FFT Implementation


In Digital Holographic Imaging, a high throughput FFT is required. 2D FFT transforms
of up to 2048x2048 complex points should be calculated in real-time. There are many
”A key component in a digital holographic imaging system is the holographic video
different ways to implement an FFT processor. The computations can be done in a number
camera. The purpose of the camera is to record the hologram on a solid-state sensor and
of iterations by time multiplexing a single memory and arithmetic unit (a), or by using a
digitally construct the image. The image construction is very computational expensive
pipelined architecture (b).
involving large 2-dimensional FFT calculations. A major task of designing the camera
system will thus be to design the ASIC, which will perform the image reconstruction.” a) do
0
b) FIFO
FIFO FIFO
FIFO FIFO
FIFO FIFO
FIFO

BF

BF

BF

BF
BF

BF

BF

BF
Memory
Memory
di
Hologram
addr
Control
Control &&twiddle
twiddlefactor
factor ROM
ROM Stage 1 Stage 2 Stage 3 Stage 4

Object Algorithm + Compact design + High throughput


+ + Flexible - Large design
- Low throughput
Reference

The achieve a high throughput, a pipelined architecture is selected. The implementation


ASIC is based on the Radix-22 decimation-in-frequency algorithm with a single path delay
Result
feedback.

Data scaling in pipelined architectures Data scaling - Convergent block floating point
The problem with a fixed point FFT is to a) b)
Existing approach:
20 # output bits from the butterfly
Large intermediate buffer
maintain the accuracy and preserve the # exponent bits
The basic idea is that the output from a radix-4 stage is a
dynamic range at the same time. One W(n)
set of 4 independent groups that can use different scale
15

way to achieve a high signal-to-noise factors. This will converge in the pipeline towards one

Trunc
Delay

Trunc
10 10
Delay
ratio is to increase the internal word exponent for each output sample from a pipelined FFT. 4log ( N ) −stage
4

length for every stage in the pipelined 5 5

The problem is the large memory requirements, caused


21 output bits

12 output bits
10 input bits

10 input bits

FFT (variable datapath), i.e. the word


0 0 by the intermediate buffers between the FFT stages, Controller
Controller
length will be wider at the output than at
required for storing one complete group before rescaling
Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

the input (a).


can be performed. Another drawback is increased delay
Another way to improve the signal-to-noise ratio without increasing the internal word in the pipeline.
length is to use data scaling (b). One example of data scaling is block floating point that

decoder
Final decoder
uses exponents, or scaling factors, for internal representation to improve the SNR. The
CBFP

CBFP

CBFP

CBFP

CBFP
CBFP

CBFP

CBFP

CBFP

CBFP
Stage
Stage11 Stage
Stage22 Stage
Stage 33 Stage
Stage 44 Stage
Stage 55 Stage
Stage66
exponents are usually shared between the real and imaginary part of a complex value,

Final
(radix-2)
or even shared among a set of complex values, unlike normal floating point
representation. Delay
Delay Delay
Delay Delay
Delay Delay
Delay

Data scaling - Hybrid floating point Delay feedback implementation


Area comparison
1.8

Presented approach: 1.6


Flip-flops
Dual port memory
a) b) Single port memory
1.4 Double width memory
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Single
Singleport
portRAM
RAM 0 31:16 31:16
0
32x16
32x16 1 1.2
Single
Singleport
portRAM
Size in mm2

1 RAM
32x32
32x32
MUL

MUL

MUL

MUL

MUL

1
MBF

MBF

MBF

MBF

MBF
MBF

MBF

MBF

MBF

MBF
MUL

MUL

MUL

MUL

MUL
IBF
IBF

Single
Singleport
portRAM
RAM 15:0 15:0
32x16
32x16 ADDR nWE 0.8

0.6

Address
Address counter
counter Address
Addresscounter
counter 0.4
IBF MUL MBF
0.2

FIFO
FIFO FIFO
FIFO
FIFO
FIFO 0

Different approaches:
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Normalizer

exponent exponent
Normalizer

exponent exponent Memory bits


Equalizer

Equalizer
Equalizer

Equalizer

Butterfly
Butterfly
(BF)
(BF) BF
BF BF
BF 1. Dual port memory connected to an address generator. This allows simultaneous reads
W(n) -j-j
and writes to any memory location. Requires larger area than single port memories.
Normal butterfly unit Complex multiplier
and normalizing unit
Modified butterfly using equalizers to align the
data. FIFO’s with scale factor representation
2. Two single port memories, alternating between reading and writing every clock cycle
(a). The drawback is duplication of the address logic when using two memories
The proposed solution is to rescale data on the fly and minimize the memory instead of one.
requirements. No changes are required of the complex adders/multipliers due to the 3. Only one single port memory but with double word length (b). This is possible, due
simple scaling logic using normalizers and equalizing units. The scaling logic uses only to the consecutive addressing scheme used in the delay feedback. In addition to
one scale factor for representing both the real and imaginary part of a complex value. removing the duplicated address logic, the total number of memories for placement
will be reduced.

Precision Memory and area requirements


The proposed FFT processor generates a constant Signal-to-Noise ratio for all kinds of For a 2048 complex point FFT processor, the memory occupies approximately 55% of
input signals due to the scaling logic in each pipeline stage. An FFT processor using the chip area. Reducing the memory requirement will therefore significantly affect the
variable datapath produces a higher SNR if the input signal utilizes the full dynamic total size of the design. Our approach is both memory efficient and accurate.
range, but for arbitrary input signals, scaling can be used to improve the Signal-to-Noise 160
Variable data path
11
Variable data path

ratio. One of the reasons that CBFP is not capable of keeping a constant SNR is that there 140 Convergent block floating point
Our approach
Our approach using RDL
10 Convergent block floating point
Our approach
120 9

is usually no CBFP logic in the first radix-2 stage due to the high memory requirements.
Kbits of memory

60
2
Size in mm

100 8
50
% of resources

80 7 40
1 60
60 6 30

40 5 20
50
20 4 10

points points 0
0 3 FIFO memory Complex MUL + ROM Butterfly Equalizer Normalizer
40 512 1024 2048 4096 512 1024 2048 4096

Memory requirements Total chip size Resource allocation


Amplitude

SNR (dB)

0 30

20
Conclusion
Constant data path
Variable data path
An FFT processor using a novel data scaling approach that can operate on a wide range
of input signals, keeping the SNR at a constant level, has been presented. Compared to
10 Convergent block floating point
Our approach without VDL
Our approach with VDL

-1 0
1 1/2 1/4 1/8 1/16 1/32 1/64
block scaling approaches, such as CBFP, the proposed design requires significantly
smaller chip area due to the reduced memory requirements. At the same time it is
input amplitude

Arbitrary input signals with different amplitudes SNR for different amplitudes
capable of producing a higher SNR since scaling is applied in all pipeline stages.

You might also like