Lenart

A 2048 complex point FFT processor for
acceleration in Digital Holographic Imaging

Thomas Lenart and Viktor Öwall
Digital Holographic Imaging FFT Implementation

In Digital Holographic Imaging, a high throughput FFT is required. 2D FFT transforms
of up to 2048x2048 complex points should be calculated in real-time. There are many
”A key component in a digital holographic imaging system is the holographic video
different ways to implement an FFT processor. The computations can be done in a number
camera. The purpose of the camera is to record the hologram on a solid-state sensor and
of iterations by time multiplexing a single memory and arithmetic unit (a), or by using a
digitally construct the image. The image construction is very computational expensive
pipelined architecture (b).
involving large 2-dimensional FFT calculations. A major task of designing the camera
system will thus be to design the ASIC, which will perform the image reconstruction.” a) do
0
b) FIFO
FIFO FIFO
FIFO FIFO
FIFO FIFO
FIFO
BF
BF
BF
BF
BF
BF
BF
BF
Memory
Memory
di
Hologram
addr
Control
Control &&twiddle
twiddlefactor
factor ROM
ROM Stage 1 Stage 2 Stage 3 Stage 4
Object Algorithm + Compact design + High throughput

+ + Flexible - Large design
- Low throughput
Reference
The achieve a high throughput, a pipelined architecture is selected. The implementation

ASIC is based on the Radix-22 decimation-in-frequency algorithm with a single path delay
Result
feedback.
Data scaling in pipelined architectures Data scaling - Convergent block floating point
The problem with a fixed point FFT is to a) b)
Existing approach:
20 # output bits from the butterfly
Large intermediate buffer
maintain the accuracy and preserve the # exponent bits
The basic idea is that the output from a radix-4 stage is a
dynamic range at the same time. One W(n)
set of 4 independent groups that can use different scale
15
way to achieve a high signal-to-noise factors. This will converge in the pipeline towards one
Trunc
Delay
Trunc
10 10
Delay
ratio is to increase the internal word exponent for each output sample from a pipelined FFT. 4log ( N ) −stage
4
length for every stage in the pipelined 5 5
The problem is the large memory requirements, caused

21 output bits
12 output bits
10 input bits
10 input bits
FFT (variable datapath), i.e. the word

0 0 by the intermediate buffers between the FFT stages, Controller
Controller
length will be wider at the output than at
required for storing one complete group before rescaling
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
the input (a).

can be performed. Another drawback is increased delay
Another way to improve the signal-to-noise ratio without increasing the internal word in the pipeline.
length is to use data scaling (b). One example of data scaling is block floating point that
decoder
Final decoder
uses exponents, or scaling factors, for internal representation to improve the SNR. The
CBFP
CBFP
CBFP
CBFP
CBFP
CBFP
CBFP
CBFP
CBFP
CBFP
Stage
Stage11 Stage
Stage22 Stage
Stage 33 Stage
Stage 44 Stage
Stage 55 Stage
Stage66
exponents are usually shared between the real and imaginary part of a complex value,
Final
(radix-2)
or even shared among a set of complex values, unlike normal floating point
representation. Delay
Delay Delay
Delay Delay
Delay Delay
Delay
Data scaling - Hybrid floating point Delay feedback implementation

Area comparison
1.8
Presented approach: 1.6

Flip-flops
Dual port memory
a) b) Single port memory
1.4 Double width memory
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Single
Singleport
portRAM
RAM 0 31:16 31:16
0
32x16
32x16 1 1.2
Single
Singleport
portRAM
Size in mm2
1 RAM
32x32
32x32
MUL
MUL
MUL
MUL
MUL
1
MBF
MBF
MBF
MBF
MBF
MBF
MBF
MBF
MBF
MBF
MUL
MUL
MUL
MUL
MUL
IBF
IBF
Single
Singleport
portRAM
RAM 15:0 15:0
32x16
32x16 ADDR nWE 0.8
0.6
Address
Address counter
counter Address
Addresscounter
counter 0.4
IBF MUL MBF
0.2
FIFO
FIFO FIFO
FIFO
FIFO
FIFO 0
Different approaches:
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Normalizer
exponent exponent
Normalizer
exponent exponent Memory bits

Equalizer
Equalizer
Equalizer
Equalizer
Butterfly
Butterfly
(BF)
(BF) BF
BF BF
BF 1. Dual port memory connected to an address generator. This allows simultaneous reads
W(n) -j-j
and writes to any memory location. Requires larger area than single port memories.
Normal butterfly unit Complex multiplier
and normalizing unit
Modified butterfly using equalizers to align the
data. FIFO’s with scale factor representation
2. Two single port memories, alternating between reading and writing every clock cycle
(a). The drawback is duplication of the address logic when using two memories
The proposed solution is to rescale data on the fly and minimize the memory instead of one.
requirements. No changes are required of the complex adders/multipliers due to the 3. Only one single port memory but with double word length (b). This is possible, due
simple scaling logic using normalizers and equalizing units. The scaling logic uses only to the consecutive addressing scheme used in the delay feedback. In addition to
one scale factor for representing both the real and imaginary part of a complex value. removing the duplicated address logic, the total number of memories for placement
will be reduced.
Precision Memory and area requirements

The proposed FFT processor generates a constant Signal-to-Noise ratio for all kinds of For a 2048 complex point FFT processor, the memory occupies approximately 55% of
input signals due to the scaling logic in each pipeline stage. An FFT processor using the chip area. Reducing the memory requirement will therefore significantly affect the
variable datapath produces a higher SNR if the input signal utilizes the full dynamic total size of the design. Our approach is both memory efficient and accurate.
range, but for arbitrary input signals, scaling can be used to improve the Signal-to-Noise 160
Variable data path
11
Variable data path
ratio. One of the reasons that CBFP is not capable of keeping a constant SNR is that there 140 Convergent block floating point
Our approach
Our approach using RDL
10 Convergent block floating point
Our approach
120 9
is usually no CBFP logic in the first radix-2 stage due to the high memory requirements.
Kbits of memory
60
2
Size in mm
100 8
50
% of resources
80 7 40
1 60
60 6 30
40 5 20
50
20 4 10
points points 0
0 3 FIFO memory Complex MUL + ROM Butterfly Equalizer Normalizer
40 512 1024 2048 4096 512 1024 2048 4096
Memory requirements Total chip size Resource allocation

Amplitude
SNR (dB)
0 30
20
Conclusion
Constant data path
Variable data path
An FFT processor using a novel data scaling approach that can operate on a wide range
of input signals, keeping the SNR at a constant level, has been presented. Compared to
10 Convergent block floating point
Our approach without VDL
Our approach with VDL
-1 0
1 1/2 1/4 1/8 1/16 1/32 1/64
block scaling approaches, such as CBFP, the proposed design requires significantly
smaller chip area due to the reduced memory requirements. At the same time it is
input amplitude
Arbitrary input signals with different amplitudes SNR for different amplitudes
capable of producing a higher SNR since scaling is applied in all pipeline stages.

Lenart

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lenart

Uploaded by

Copyright:

Available Formats

A 2048 complex point FFT processor for

acceleration in Digital Holographic Imaging

Digital Holographic Imaging FFT Implementation

Object Algorithm + Compact design + High throughput

The achieve a high throughput, a pipelined architecture is selected. The implementation

length for every stage in the pipelined 5 5

The problem is the large memory requirements, caused

FFT (variable datapath), i.e. the word

the input (a).

Data scaling - Hybrid floating point Delay feedback implementation

Presented approach: 1.6

exponent exponent Memory bits

Precision Memory and area requirements

Memory requirements Total chip size Resource allocation

You might also like