You are on page 1of 40

The DSP Primer11

Transposed FIR
with Multiplier Block

Return Return

DSPprimer Home DSPprimer Notes

August 2005, University of Strathclyde, Scotland, UK For Academic Use Only


THIS SLIDE IS BLANK
Top
Transposed FIR with Multiplier Block 11.1

• Finite Impulse Response (FIR)


digital filter structures shown
are mathematically identical

• Transposed FIR can be formally Standard FIR


derived from Standard FIR via
cut-set retiming

• Transposed FIR with Multiplier


Block is most area efficient full-
parallel FIR structure for FPGAs Transposed FIR
when implementing in the main
logic fabric

• Multiplier block employs only


add/sub/shift operators which
map well to FPGAs
Transposed FIR with Multiplier Block

August 2005, For Academic Use Only, All Rights Reserved


Notes:

The Transposed FIR structure has several advantages over the Standard FIR structure:

• input samples are fed simultaneously to the multiplication stage so no input shift register is required;

• issues with non-symmetric addition trees are removed - this also simplifies the design process;

• progressive addition chain data flow eases FPGA layout considerably;

• filter latency is reduced;

• identical tap coefficient magnitudes can share multiplication hardware.

The Transposed FIR structure with Multiplier Block is the most area efficient full-parallel FIR structure when
implementing in the main FPGA logic fabric. It is perfectly possible to implement Transposed structures using
the embedded multipliers provided by Xilinx or Altera although these may not always be available/suitable.

The most common architecture for logic fabric implementation of FIR filters is Distributed Arithmetic (DA). This
technique is employed by the filter generation cores provided by both Xilinx and Altera.
Top
Inside the Multiplier Block (I) 11.2

<<1

3 3x

<<2

13 13x

<<2

5 <<3
21x
Adder/Subtractor: output equals _21
21 graph input multiplied by 21
Shift signal left by p bits
<< p
(multiply signal by 2p)
<<5
Signal is subtracted from 37x
- 37
other entering block
4
Signal is 4 bits wide

Crossing signals are pipelined

August 2005, For Academic Use Only, All Rights Reserved


Notes:
A multiplier block may contain several pipeline stages with products being generated at each stage. Products
generated in the earlier pipeline stages may or may not be required at the final stage to be fed to the summation
chain. If products generated in the early pipeline stages are required at the output (such as 3 above), they must
be registered at each pipeline stage to keep data aligned. This costs registers and hence FPGA area and so
products are generated as “late” as possible, i.e. in the pipeline stage immediately before they are required.

Only +ve numbers are generated because -ve coefficient weightings can be restored by subtracting the +ve
equivalent at the summation chain instead of adding. Also, only +ve odd numbers are generated because any
even number product can be created by left-shifting the “base” odd number en-route to the summation chain.

To illustrate these two hardware saving points, if the multiplier block above is for a filter with tap values of 3 and
-6, the 3x product would be used twice as shown below. For -6, the 3x product would be left-shifted once en-
route to the summation chain and subtracted to restore the -ve weighting.
37

X 13
21

3
Multiplier Block Y
13
21

3
Clk 37

-6 (3 << 1)
Clk

To maximise hardware savings, generated products should be reused as much as possible within the multiplier
block (e.g. the 3 product in the multiplier block feeds the 21 and 13 adders). However, hardware cost reduction
must be balanced against the issues of fanout and associated routing/timing problems this may cause.....
Top
Inside the Multiplier Block (II) 11.3

• By cascading adders, subtractors and shift operations, any integer


multiplication product of the input sample can be generated;

• Synthesising multiplier blocks is a highly time consuming process and


for a reasonably sized filter, exploring every possible graph would take
millennia!

• Heuristics must be used based on knowledge of the problem and the


target architecture to find a mathematically correct solution in a
reasonable time...

• Even with heuristics, block synthesis is too time consuming and error
prone a process for humans - software and computers must be used;

• For FPGAs, the goal is to minimise the amount of hardware consumed


when implementing the filter while respecting timing requirements.

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Synthesising multiplier blocks has been the subject of considerable research in recent decades and has been
shown to be an NP-complete problem.

It is possible for a human to work out a multiplier block by hand but it is a very time consuming and error prone
process. Also, it is very easy to miss a more optimal solution.

Using software to perform the multiplier block design task and filter generation removes the delays and errors
introduced by the human factor.
Top
OpFiltGen HDL Filter Generation Software 11.4
Filter Coefficients

OpFiltView OpFiltGen
DSP Analysis

• OpFiltGen generates
VHDL for numerous filter Simulation Test data
types (e.g. singlerate,
polyphase interpolation/ Verify results
decimation);
Synthesis
• OpFiltView generates a
• Other pieces of design
schematic representation;
software can automate
• Generated VHDL is fully- OpFiltGen filter
pipelined, highly area Implementation generation as part of a
efficient, simulatable and larger design and perform
synthesiseable. simulation, synthesis and
Bitstream implementation of an
entire design;
OpFiltGen: Optimised Filter Generator

August 2005, For Academic Use Only, All Rights Reserved


Notes:

OpFiltGen takes the following inputs:

• integer filter coefficients quantised to the required number of bits;

• filter type (e.g. singlerate, halfband, hilbert, interpolation, decimation);

• input bit-width;

• signed/unsigned input;

• number of distinct data channels to process (useful for I/Q processing for example);

• rate change factor (for interpolation/decimation filters).

Generated VHDL is generic and can be targeted at any device by any manufacturer via synthesis.

Most filters are generated in a few seconds - considerably faster than a human trying to work out the entire
multiplier block and then code optimal VHDL!

OpFiltGen works out the minimum required bit-width of every signal precisely to ensure maximum hardware
efficiency while retaining full mathematical precision.
Top
OpFiltView Schematic (top level) 11.5
-17
-29
-31
1
-9
455
33
X 617
74 (37 << 1)
261
79
7
25
43
-76 (19 << 2)
21
-168 (21 << 3)
19
Multiplier Block -172 (43 << 2)
25
-28 (7 << 2)
79
261
37
617
33
910 (455 << 1)
9 Y
1024 (1 << 10)
Clk 31
910 (455 << 1)
29
617
17
261
-28 (7 << 2)
-172 (43 << 2)
-168 (21 << 3)
DESIGN DESCRIPTION
Filter Type: Single Rate -76 (19 << 2)
Number of Taps: 31
Number of Channels: 1 25
Bit Widths: X: 12; Y: 25 79
Rate Change Factor: N/A
System Inputs: Clk 74 (37 << 1)
System Outputs: N/A
33
-9
-31
Clk
-29
-17

August 2005, For Academic Use Only, All Rights Reserved


Notes:

This is the top-level view filter generated by OpFiltGen. Remember that it is the generated VHDL that actually
implements the filter - the schematic is for viewing only.

It is clear to see that the number of multiplier block outputs is far less than the number of taps (31). This is due
to the sharing and reuse afforded by only generating +ve odd numbers.

Within the multiplier block and summation chain, the hardware is fully pipelined to achieve the maximum clock
rates required for full-parallel implementations.
Top
Inside the Summation Chain 11.6

• Once the required multiplication products of the input have been


generated by the multiplier block, the summation chain completes the
filter structure;

• The summation chain is fully-pipelined for full-parallel sampling rates;

• Adder output widths are only as wide as they need to be meaning the
adder widths increase towards the filter output as the potential output
sample value grows.

0
18 18 19 19 19
37 21 13 3 _-6
37 21 13 3 3 <<1

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Multiplier Block outputs may be used at any point in the summation chain as required. Sharing multiplier block
outputs between multiple taps is hardware efficient although care must be taken with fanout issues.

Each tap adder must be wide enough to accommodate the worst-case sample value that may occur (e.g. -2048
in all -ve taps and 2047 in all +ve taps for a 12-bit input will give the worst case sample value). Due to the
coefficients being fixed, the worst case sample value can be calculated at filter generation time along with the
worst case value at every stage of the summation chain allowing precise bit-widths to be specified. Hence, the
adder bit-width will increase from the first adder in the chain until the last which will be the output width of the
filter.
Top
OpFiltView Schematic (multiplier block) 11.7

• Several of the 12
1x

adders feed others in


subsequent pipeline
<<3
16
9 9x

<<1
14

stages;
3
<<4
17
17 17x

<<5
17
31x
_31

• The block requires <<2

5
15

three pipeline stages


<<5
18
33 33x

<<4
19

to generate all of the


79x
_79

<<3
15

required products;
7x
_7
<<3
17
25 25x

21
261 261x
<<8

• Note that the input <<1


19
17
19x

itself (1x) is required


_ 17
29 29x
<<5
18

at the output (for


37 37x
<<5

power of two taps) <<3

_21
17
21x

and is pipelined <<3


43
18
43x

three times before 21

emerging.
455 455x
<<6
22
617 617x
<<5

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The multiplier block above implements the multiplications required for a 31-tap low-pass filter. The block is
reasonably complex even for a relatively short filter. The block below performs the multiplications for a 121-tap
filter (four times the coefficients). The block is more complex than the 31-tap block but because more product
sharing is available, the graph is not 4 times bigger.
Top
Area Comp. with Xilinx Core (2-10 bits) 11.8

FPGA hardware area


7000
10 (DA)

Average FPGA area (slices)


8 (DA)
6000 6 (DA)
4 (DA)
2 (DA)
5000 10 (RSG)
8 (RSG)
6 (RSG)
4000 4 (RSG)
2 (RSG)
Coeff. Bit−Width
3000

2000

1000

0
0 50 100 150 200 250
Filter Length

August 2005, For Academic Use Only, All Rights Reserved


Notes:

In this graph, filter length varies from 15 to 210 and coefficient bit-width from 2 to 10. Filters had 10-bit inputs
and each data point is averaged over 10 unique coefficient sets. All OpFiltGen filters require considerably less
area than the Coregen equivalents. The 10-bit OpFiltGen filters are comparable in size to the 2-bit Coregen
filters!

As filter length increases, the area advantages of the Transposed FIR with Multiplier Block architecture become
more apparent.
Top
Area Comp. with Xilinx Core (12-20 bits) 11.9

FPGA hardware area


10000
20 (DA)
18 (DA)
Average FPGA area (slices) 16 (DA)
8000 14 (DA)
12 (DA)
20 (RSG)
18 (RSG)
16 (RSG)
6000 14 (RSG)
12 (RSG)
Coeff. Bit−Width
4000

2000

0
0 50 100 150 200 250
Filter Length

August 2005, For Academic Use Only, All Rights Reserved


Notes:

This graph illustrates that the OpFiltGen filters require significantly less area in general than the Coregen
equivalent filters for all coefficient bit-widths except for 20. For 20-bit coefficients, the results are comparable.
The results also demonstrate that, as coefficient bit-width and filter length increase, the area jumps between
plots for OpFiltGen are higher than those of the Coregen filters. This is because distributed arithmetic scales
more linearly than the multiplier block technique.
Top
Conclusions 11.10

• Transposed FIR filter with Multiplier Block is the most efficient


implementation method for fixed coefficient, full-parallel FIR filters in
the main logic fabric on FPGAs;

• Multiplier Block synthesis is a complex problem that must be automated


via software for optimised results in a reasonable time;

• Automatically synthesising the Multiplier Block and generating the


associated VHDL to implement the entire filter saves engineers
countless hours of laborious implementation;

• Comparisons with available filter generation software shows this


technique to be more area efficient.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Top
Software Radio Filter Case Study 11.11

• As ADC and FPGA sample processing rates continue to move towards


radio frequency (RF) rates, Software Radio flexibility increases due to
the associated bandwidth increase

• FPGAs provide an ideal platform for software radio receivers:

• Infinite reconfigurability allows different air-interfaces

• Provision of scalable, multi-channel and parallel/serial processing as


required to suit specifications

• For Software Radio receivers, FPGAs are used to push the radio
sampling rate closer to the antenna by operating full-parallel processing
architectures and taking advantage of on-chip dedicated high-speed
data I/O hardware

• Full-parallel FIR filters can be used with Digital Down-Converter (DDC)


implementations for Software Radio receivers

August 2005, For Academic Use Only, All Rights Reserved


Notes:

This case-study discusses full-parallel FIR filters for FPGAs using their application in Digital Down-Converters
(DDCs) for Software Radio receivers as a background. A commercially available 4-channel, 40MHz DDC
architecture implemented on a Xilinx Virtex-II FPGA is described as a basis and a 2-channel, 80MHz DDC
system to be implemented on the same device is proposed. Two implementations of the required low-pass
decimation filters are generated and compared using OpFiltGen and the Xilinx Distributed Arithmetic core. The
proposed 2-channel, 80MHz DDC is bit-true modelled in SystemView, implemented in VHDL and verified via
simulation. The proposed DDC using OpFiltGen filters is found to require less FPGA resource than the original
4-channel, 40MHz DDC and the proposed DDC using Xilinx filters.
Top
Commercial 40MHz DDC 11.12

Decimation by 2 low-pass filter


I
(50MHz)
ADC samples
(14-bit, 100MHz) cos ( 2πf t t )
Note: each filter cuts-off at 20MHz and when the I/Q outputs are
sin ( 2πf t t ) considered together, the bandwidth is 40MHz

Decimation by 2 low-pass filter


Q
(50MHz)
f t = tuning frequency

• Implemented by ICS Ltd. on the Xilinx Virtex-II FPGA provided


with the ICS-554 Software Radio receiver

• Low-pass filtering and downsampling are combined into a single


decimation filter for efficient hardware implementation

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Software radio receivers require mixing, filtering and downsampling of received signals to allow data to be
processed at a suitable rate. Part of this process can be achieved in FPGAs using a Digital Down-Converter
(DDC). A generic single channel DDC is shown below:.

Low pass filter Downsample (N) I


ADC (B/N MHz)
samples
(B MHz) cos ( 2πf t )
t
sin ( 2πf t )
t

Low pass filter Downsample (N) Q


ft = tuning (B/N MHz)
frequency

As well as mixing the incoming real signal from the ADC to extract the complex signal, a DDC must filter the
complex signal to reject image components introduced by the mixing process and then downsample. For
maximum software radio flexibility, the ADC, mixer and filters should sample as quickly as possible. Hence, if
the DDC is implemented on an FPGA, full-parallel techniques can be used to reach the required sampling rates.

Note that the low-pass filtering and downsampling can be combined using a decimation filter. This is hardware
efficient because if downsampling is performed separately (post-filter), filter hardware is wasted calculating
samples that are subsequently thrown away in the downsampling process. Combining the filtering and
downsampling into a decimation filter allows this inefficiency to be eliminated.
Top
Commercial 4-Channel, 40MHz DDC 11.13

I
ADC 1 DDC 1 (40 MHz bandwidth)
Q
I
ADC 2 DDC 2 (40 MHz bandwidth)
Q
I
ADC 3 DDC 3 (40 MHz bandwidth)
Q
I
ADC 4 DDC 4 (40 MHz bandwidth)
Q
Xilinx Virtex-II XC2V3000 FPGA

FPGA Resource Usage


• Four 40MHz DDCs are
Dedicated Multipliers 96/96 (100%)
combined to implement a 4-
Block SelectRAM 48/96 (50%)
channel receiver (ICS-554) on
Slices 10242/14336 (72%) the Xilinx FPGA (XC2V3000)

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Four of the 40MHz DDCs shown in the previous slide are used in to implement a 4-channel software radio
receiver as shown in this slide.

All of the dedicated multipliers are consumed in addition to half the BlockRAMS and 72% of the main logic
fabric.
Top
Proposed 80MHz DDC 11.14

Decimation by 2 low-pass filter


I
(100MHz)
ADC samples
(12-bit, 200MHz) cos ( 2πf t t )
Note: each filter cuts-off at 40MHz and when the I/Q outputs are
sin ( 2πf t t ) considered together, the bandwidth is 80MHz

Decimation by 2 low-pass filter


Q
(100MHz)
f t = tuning frequency

• Based on the ICS 4-Channel, 40MHz DDC, a 2-Channel, 80MHz


DDC was designed and implemented on the same FPGA

• The decimation by 2 low-pass FIR filters must be implemented as


full-parallel structures to meet the required sampling rates

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Using the specifications of the 4-channel, 40MHz receiver described previously as a basis, a 2-channel, 80MHz
receiver was designed and implemented. A single channel of the 80MHz receiver is shown in this slide. The
ADCs used in the 4-channel receiver (Analog Devices 6645) provide 14-bit samples at rates up to 105 MHz.
For the proposed 80MHz receiver, the ADCs must provide samples at 200MHz. This can be achieved using the
Analog Devices AD9430 which yields 12-bit samples up to 210MHz.

The decimation by 2 filters must be implemented using full-parallel techniques to meet the sampling rate
requirements of the DDC.
Top
Polyphase Decimation by 2 FIR 11.15
Hardware clock rates

B MHz B/2 MHz

Input Polyphase 0 Output


samples samples

Polyphase 1

• Input samples are demultiplexed to each polyphase in turn

• Polyphases are implemented as single-rate FIR filters

• Each phase need only operate at the output clock rate since new
samples are only provided every 2 clock cycles

• The adder combines the phase outputs and yields the filter output at
the decimated rate

August 2005, For Academic Use Only, All Rights Reserved


Notes:

As stated previously, to avoid hardware inefficiency, low-pass filtering and downsampling can be combined into
a single piece of hardware that implements both - known as a polyphase decimation filter. For a decimation
factor of N, the low-pass filter impulse response is split into N polyphases banks that are each implemented as
a single-rate FIR filter. Since each polyphase filter bank need only calculate output samples at the lower filter
output sampling rate, as long as the multiplexing hardware is efficiently implemented to allow very high speed
operation, the phase filters can be implemented using parallel hardware to allow very high sampling rates. The
outputs of all phases are added to yield each filter output sample at the decimated rate. This process is
illustrated in this slide for a decimation by 2 filter.
Top
Decimation by 2 FIR Implementation 11.16
• The decimation by 2 filters were specified as follows:
Parameter Specification
Low-pass cutoff 40MHz
Low-pass stop-band attenuation -100dB
Downsampling factor 2

• VHDL for the filters was automatically generated using OpFiltGen

• Generated VHDL was synthesised using Synplify Pro 7.3 and placed
& routed using Xilinx ISE 5.2.03i

• The Xilinx Core Generator (Coregen) Distributed Arithmetic v8.0 was


also used to implement the filter specification as a comparison:
FPGA Resource OpFiltGen Filter Usage Coregen Filter Usage
Dedicated Multipliers 0/96 (0%) 0/96 (0%)
Block SelectRAM 0/96 (0% 0/96 (0%
Slices 1681/14336 (12%) 1997/14336 (14%)

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Since the proposed DDC requires a bandwidth of 80MHz, each filter must have a bandwidth of 40MHz.

Also, because the filter implementation techniques in use here primarily target the logic fabric, dedicated
resources such as the multipliers and BlockRAMs are not required and can be used for additional functionality.

As expected, the OpFiltGen generated filters requires less hardware than the Coregen equivalents.
Top
Decimation FIR - OpFiltView Schematic 11.17

• OpFiltGen Viewer produces a schematic view of generated VHDL

• Users can “drill” down from the top-level to view multiplier graphs

August 2005, For Academic Use Only, All Rights Reserved


Notes:
This slide illustrates the top-level polyphase structure of the OpFiltGen generated filter. The schematic was
automatically generated using OpFiltView and the input multiplexer, phase filters and output adder can clearly
be seen.
Top
2-Channel, 80MHz DDC Implementation 11.18

ADC samples formed from


additive white gaussian
noise and two tones:
10MHz(0dB), 70MHz(-12dB)
ADC Samples fs/4 mixer Decimation by 2 filters
(200MHz)

• The 80MHz bandwidth DDC was modelled in SystemView using


precise bit-widths and signed integer arithmetic to provide a bit-true
comparison with the VHDL implementation

• VHDL was written to implement the 2-Channel, 80MHz receiver


(code for mixer, filter integration)

• Tuning frequency of 50MHz allowed hardware efficient fs/4 mixer

• Design was simulated using Aldec Active-HDL 5.2 and verified


against SystemView model

August 2005, For Academic Use Only, All Rights Reserved


Notes:
In addition to the filters, a complex mixer must also be implemented to extract the in-phase and quadrature
components from the incoming ADC samples. In general, this is achieved using look-up tables (stored in
BlockRAM) with suitable values for the sine/cosine tuning frequency generation and hardware to multiply the
incoming signal and the sine/cosine tones together. An address counter (not shown) would be used to cycle
through the values stored in BlockRAM.

When the tuning frequency f t is exactly a quarter of the system sampling frequency, a sine wave cycle is
represented by the sequence “0,1,0,-1,0” and a cosine cycle by “1,0,-1,0,1”. This greatly simplifies the mixing
process since the ADC samples do not require explicit multiplication. The “multiplier” either passes the ADC
sample as is (1), complemented (-1) or 0. This process is known as “fs/4 mixing” and maps very well to FPGAs.
No actual multiplication hardware or BlockRAMs are required. For the 80MHz DDC, the system sampling
frequency is 200MHz, hence the fs/4 frequency is 50MHz. If f t is set to 50MHz, an fs/4 mixer can be
implemented which requires very little hardware as shown in (b) below. The “complement” hardware effectively
multiplies the input by -1 and the multiplexers select the appropriate input depending on the counter value.

I
0 I

ADC samples cos ( 2πf t t ) Dedicated ADC samples

multiplier counter
sin ( 2πf t t )
complement
Block
SelectRAM
0 Q
Q

(a) Generic (b) fs/4


Top
Results/Conclusions 11.19
VHDL implementation
matches SystemView
model precisely

OpFiltGen/Coregen fil-
ters give identical mathe-
matical results

Input tones are correctly


down-converted

Attribute 4-Channel, 40 MHz DDC 2-Channel, 80 MHz DDC 2-Channel, 80 MHz DDC
(ICS Commercial) (OpFiltGen filter) (Coregen filter)
Channels 4 2 2
Channel bandwidth 40MHz 80MHz 80MHz
Dedicated Multipliers 96/96 (100%) 0/96 (0%) 0/96 (0%)
Block SelectRAM 48/96 (50%) 0/96 (0% 0/96 (0%
Slices 10242/14336 (72%) 6930/14336 (48%) 8008/14336 (56%)

• 2-Channel, 80 MHz DDC with OpFiltGen filters requires least resource

• The air-interface flexibility requirement of Software Radio receivers is


more than catered for using FPGA technology

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The 2-channel 80MHz DDC was implemented in VHDL, simulated using Aldec Active-HDL 5.2 and the output
compared with SystemView. The VHDL and SystemView implementations were identical as shown in this slide.
The spectra also demonstrate the 80MHz bandwidth of the DDC and show the example tones applied to the
DDC being correctly down-converted.

The VHDL for the 2-channel 80MHz DDC was implemented as two instances, one using the OpFiltGen
decimation filter and the other using the Coregen filter. Both instances yielded mathematically identical results
in terms of the DDC spectra but the instance using the OpFiltGen generated filters required less FPGA resource
as shown in the results table.

Case-study conclusion:

Using the original 4-channel 40MHz DDC by ICS as a basis, a 2-channel 80MHz DDC has been proposed,
designed and successfully implemented. As a comparison, filters generated by OpFiltGen and Coregen were
used to implement the DDC. The 80MHz DDC using OpFiltGen generated filters requires the least FPGA
resource of all three implementations. Also, the new design does not require any dedicated multipliers or
BlockRAMs, meaning additional functionality could be implemented on the FPGA using these resources. To
generalise the 80MHz DDC for any tuning frequency (other than 50MHz), 4 multipliers and 2 BlockRAMS could
be used to achieve this instead of the fs/4 mixers. The flexibility of FPGAs has been utilised to double the
bandwidth of the 40MHz DDC with half the channels and less FPGA resource. Hence, the air-interface flexibility
requirement of Software Radio receivers is more than catered for using FPGA technology.

You might also like