You are on page 1of 11

A Guide to Using Field Programmable Gate Arrays (FPGAs) for Application-Specific Digital Signal Processing Performance

Gregory Ray Goslin Digital Signal Processing Program Manager Xilinx, Inc. 2100 Logic Dr. San Jose, CA 95124 Abstract:
FPGAs have become a competitive alternative for high performance DSP applications, previously dominated by general purpose DSP and ASIC devices. This paper describes the benefits of using an FPGA as a DSP Co-processor, as well as, a stand-alone DSP Engine. Two Case Studies, a Viterbi Decoder and a 16-Tap FIR Filter are used to illustrate how the FPGA can radically accelerate system performance and reduce component count in a DSP application. Finally, different implementation techniques for reducing hardware requirements and increasing performance are described in detail.

Introduction:
Traditionally, digital signal processing (DSP) algorithms are implemented using general-purpose (programmable) DSP chips for low-rate applications, or special-purpose (fixed function) DSP chip-sets and application-specific integrated circuits (ASICs) for higher rates. Advancements in Field Programmable Gate Arrays (FPGAs) provide new options for DSP design engineers. The FPGA maintains the advantages of custom functionality like an ASIC while avoiding the high development costs and the inability to make design modifications after production. The FPGA also adds design flexibility and adaptability with optimal device utilization while conserving both board space and system power, which is often not the case with DSP chips. When a design demands the use of a DSP, or time-to-market is critical, or design adaptability is crucial, then the FPGA may offer a better solution. The SRAM-based FPGA is well suited for arithmetic, including Multiply & Accumulate (MAC) intensive DSP functions. A wide range of arithmetic functions (such as Fast Fourier Transform’s (FFT’s), convolutions, and other filtering algorithms) can be integrated with surrounding peripheral circuitry. The FPGA can also be reconfigured on-the-fly to perform one of many systemlevel functions. When building a DSP system in an FPGA, the design can take advantage of parallel structures and arithmetic algorithms to minimize resources and exceed the performance of single or multiple general-purpose DSP devices. Distributed Arithmetic[1] for array multiplication in an FPGA is one way to increase data bandwidth and throughput by several order of

magnitudes over off-the-shelf DSP solutions. One example is a 16-Tap, 8-Bit Fixed Point, Finite Impulse Response (FIR) filter[2]. The FIR design supports more than 8 million samples per second. This example can also be implemented using multiple bits, until a “Fully Parallel Distributed Arithmetic” algorithm is obtained for higher sample rates (i.e., 55.89 million samples per second). Figure 1 compares the 16-Tap FIR filter implemented in a state-of-the-art fixed-point DSP with that of the Xilinx FPGA. As published by Forward Concepts[5], “For 1995, the state-of-the-art fixed-point DSP is rated at 50 MIPS.” Such a device requires 20 nsec per Tap to implement a 16-Tap FIR filter, which translates to a theoretical maximum (with zero wait-states) sample rate of 3.125 million samples per second. An In-System Programmable (ISP) FPGA can also be reconfigured on the board during system operation. Taking advantage of the reconfigurability feature means a minimal chip solution can be transformed to perform multiple functions. For example, an FPGA could be the basis for a system that performs one of several DSP functions. Suppose, for instance, one function is to compress a data stream in transmit mode and another function is to decompress the data in receive mode. The FPGA can be reconfigured on-the-fly to switch, or toggle, from one function to another. This capability of the FPGA adds functionality and processing power to a minimum-chip DSP system controlled with an internal or an external controller. This “Reconfigurable Computing” technique is beginning to impact design methodologies.

V.1.0

© 1995 XILINX, Inc. All rights reserved.

the general-purpose DSP devices are used to drive the technology forward. FIR Filter. Comparators and Correlators just to mention a few. The FPGA is well suited for many DSP algorithms and functional routines. such as. one can enhance the overall performance of the algorithm. The multi-path parallel data structures can be processed either through parallel DSP devices or in a single FPGA-based DSP hardware accelerator with or without the assistance of a DSP device. This process is ⇒ adequate for independent sequential algorithms (a modified Von Neumann Architecture). ⇒ ⇒ and 3⇒memory-write instructions. These operational data paths can consist of any combination of simple and complex functions. FPGA-Based DSP Accelerator: General-purpose DSP devices have been the enabling technology for most mid to high-end electronic systems. 2 . 8-Bit Fixed Point. The FPGA can be programmed to perform any number of parallel paths. Basic DSP Block Diagram System performance requirements continue to increase. more complex DSP systems in less time than larger design groups with more experienced engineers who are required to know devicespecific programming languages. and ready a complex DSP system for production in weeks. Adders. The FPGA can also be partially or completely reconfigured in the system for a modified or completely different algorithm. Typically. Multiply and Accumulation. Distributed Arithmetic Performance for 16-Tap. can design larger. Smaller design groups. test. (Note that the Xilinx XC6200[6] family can be partially reconfigured while the rest of the device is still active in the system. The DSP V. the overall process is performed in a three-step series of 1⇒memory-read.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin Data Sample Rate (MHz) 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 DSP Region FPGA Performance Characteristics for Distributed Arithmetic processor becomes less efficient when an algorithm is dependent on two or more of the past. Inc. A typical DSP algorithm contains many feed-back loops or parallel structures [see Figure 3]. FPGA Region Full Parallel DA @70-50MHz Memory-A Memory-B I/O 1-Bit Serial Distributed Arithmetic w/ XC4000E-3 @CLK=66MHz 4-Bit Parallel DA @35MHz Control Logic 2-Bit Parallel DA @16MHz 8-Bit Serial DA @8MHz Dedicated Arithmatic Processing Unit DSP Device Figure 2. The advancement in performance for DSP devices is lagging behind this growth.0 © 1995 XILINX. This is primarily due to the feedback or parallel structure of the data flow being processed sequential with additional wait-states in a DSP. Analyzing the DSP algorithm and breaking out any parallel structures or repetitive loops into multiple data paths. The FPGA-based DSP system-level design team can design. All rights reserved.1. While the DSP processor may perform multiple instructions per clock cycle (Harvard Architecture). present. The software code for a DSP algorithm of this type is not efficiently implemented in a general purpose DSP. with less experienced engineers. Counters. about 10-30% of the DSP code utilizes 60-80% of the processors power. attempted in real-time [see Figure 2]. Barrel Shifters. Because of this imbalance the DSP designer is forced to use a systolic array of DSP processors to boost the overall performance.) The primary concept is to unload the compute-intensive functions requiring multiple DSP 1 8 16 24 32 40 48 # of Multiply & Accumulates per Sample Figure 1. verify. The FPGA design cycle requires less hardware-specific knowledge than most DSP chips or ASIC design solutions. 2⇒process. Although high-volume applications favor ASIC implementations. The engine in the DSP system is typically a specialized time-multiplexed sequential computer system that performs a continuous mathematical process. and/or future state conditions.

Initially. The algorithm used to calculate the “New_1” and “New_2” outputs. some DSPs can Multiply and Accumulate (MAC) in one clock cycle (DSP @66 MHz = 15 nsec per MAC) compared to eleven clock cycles (P5@100 MHz = 110 nsec per MAC) in the Pentium™ processor. First.F) = G Send . Secondly. Some DSPs have been optimized to do some dedicated functions much faster than a MPU (Von Neumann architecture).E f(D. Note this did not include the calculation of the “Diff_1” and “Diff_2” outputs.E) = F Fetch .B) = C Send .D Fetch . Inc. the Viterbi Decoder algorithm required 360 nsec [(24-clock cycles)•(15 nsec)] of processing time. Old_1 + + + M U X MSB Fetch .C g(C. The FPGA/DSP implementation is more flexible and proves to be more cost effective than multiple DSPs or an ASIC. All rights reserved.A Fetch . each Add/Subtract and MUX stage had to be performed sequentially with additional wait-states. For example. The ability to process parallel data paths within the FPGA takes advantage of the parallel structures of the four Add/Subtract-blocks in the first stage and the two Subtract-blocks in the second stage. Data_In Begin about 80% of the overall processing time. Old_1 + + R E G Case Study Viterbi Decoder A Viterbi Decoder[7] is a good example of how the FPGA can accelerate a function [Refer to Figure 4]. Basic DSP Algorithm Process Figure 4. Viterbi Decoder Block Diagram The FPGA-based DSP accelerator is conceptually similar to a Coprocessor for the Microprocessor Units (MPUs) in the computer market.G New_1 Diff_1 I/O Bus I/O Bus INC + + Diff_2 MSB Adaptive Changes - + Data_Out Old_2 + + - M U X New_2 Prestate Buffer 24-bit 24-bit 1 0 Bit 24-bit Figure 3. The Add/Subtract stages required four additional operations with multiple instructions. shown in Figure 4. the MPU required a Math-Coprocessor to accelerate computational algorithms. The initial design used two 66 MHz programmable DSP devices to implement the algorithm. These two outputs would have required an additional seven clock cycles. required 17-computational clock cycles.B f(A. The Viterbi Decoder algorithm utilized + - R E G I/O Bus M U X R E G New_1 MSB INC + + R E G Diff_1 I/O Bus + + R E G Diff_2 MSB + R E G R E G M U X R E G New_2 Old_2 Prestate Buffer Optional Pipelining Register's 24-bit 24-bit 1 0 Bit 24-bit Figure 5. Diagram FPGA-Based Viterbi Decoder Block The design conversion resulted in the following performance: The FPGA based Viterbi Decoder required 135 nsec [(9-clock cycles)•(15 nsec)] of total V. The combined functionality of the FPGA and the general-purpose DSP can support several magnitudes higher data throughput than two or more parallel DSP devices. 3 . This forced the I/O Bus speed to a maximum of 30 nsec. The two MUXblocks take advantage of the ability to register and hold the input data until needed with no external memory or additional clock cycles. Hence. the wait state associated with the DSP’s external SRAM memory required two 15 nsec clock cycles for each memory access. the data Bus transfer required 30 nsec for each data transaction. FPGA & DSP Based Viterbi Decoder: The Viterbi Decoder is well suited for the FPGA. plus an additional 7-clock cycles due to the wait-state associated with the DSP’s external SRAM memory. The functionality of the Coprocessor became so frequently used that it is now an integrated part of the latest processor families.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin clock cycles into the FPGA and allow the DSP processor to concentrate on optimized single-clock functions.0 © 1995 XILINX.1.C Fetch . Hence. There were two limiting factors for this DSP-based design. The seven (2’sComplement) I/O Data words were multiplexed on a common I/O Bus at the maximum I/O rate of 33 MHz. This memory was required by the programmable DSP for data storage.

from one to thousands. the algorithm becomes much more complex for a CPU-based architecture. To use the FPGA in a DSP design. As the number of MACs increase. The SDA-FIR design supports an on-off chip I/O-data rate at more than 8 million samples per second [13. the use of Distributed Arithmetic for array multiplication in an FPGA is one technique used to implement and increase the function’s data bandwidth and throughput by several order of magnitudes over off-the-shelf DSP solutions. This multiplexed implementation minimizes the CLB resources used.18 0 P-5 Hardware Device DSP 1. 20 18 16 Relative Performance (@50MHz) 14 12 10 8 6 4 2 0. 8-Bit FIR filter compared to a 50 MHz fixed-point DSP processor This example can also be implemented using multiple bits.7 times faster This design can also be implemented in a multiplexed version with the original 33 MHz I/O Data transfer rate. Filter designs can vary over a wide range in the number of MACs.00 4. supporting twice the original throughput. One example is a “Serial Distributed Arithmetic” (SDA).12 nsec per Tap ] and occupies 400 CLBs in an XC4000E-3. the algorithm becomes more compute-intensive for any conventional DSP.5% better processing performance. which makes it ideal for implementing DSP functions in LUT-based FPGAs.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin processing time (this includes all of the outputs) compared to the 360 nsec required for the partial output data by the DSP. This is one example of how the FPGA can accelerate a DSP function.71 nsec per Tap] while occupying 68 Configurable Logic Blocks (CLBs)[3] in an XC4000E-3 FPGA. The MAC function can be implemented more efficiently with various Distributed Arithmetic (DA)[1] techniques then with conventional arithmetic methods. The I/O Data Bus can also support a 66 MHz data transfer rate. 17. All rights reserved. Note that “Old_1” and “Old_2” Datum follow the same path with only a minor difference in the magnitude of the second stage SUBblock.00 2.1. Distributed Arithmetic can make extensive use of look-up tables (LUTs). This is done by observing the symmetrical nature of the design. FPGA-Based. Processing Time (ns) 400 300 200 100 0 Two 66 MHz DSPs Six 15 ns RAMs 66 MHz DSP+FPGA Three 15 ns RAMs FPGA Based Filters: When building a Digital Filter in an FPGA. While the FPGA has no dedicated multiplier. Relative performance for various implementations of a 16-Tap. Inc. up to a full “Parallel Distributed Arithmetic” (PDA) algorithm to obtain higher I/O-data sample rates which can exceed 55 million samples per second [ 17.88 16-Tap.0 © 1995 XILINX. identify the parallel data paths and/or the operations requiring multiple clock cycles in the DSP.60 FPGA 360 ns 62 % Better Performance with FPGA Co-processor 135 ns Figure 6. 16-Tap 8-Bit Finite Impulse Response (FIR) filter[2].9 nsec / 16-Taps = 1. Performance of two Viterbi decoder implementations.7 nsec x (8-Bits + 1) / 16-Taps = 7. 4 PDA FPGA . This enhancement equates to 37. The DSP+FPGA solution is 2. the design can take advantage of parallel structures and Distributed Arithmetic algorithms to exceed the performance of multiple general-purpose DSP devices.5% of the original DSP processing time or 62. Hence. Note that increasing the number of Multiply and Accumulation blocks or Taps has no significant impact on the sample rate when Serial Distributed Arithmetic is used [see Figure 1]. This implementation requires a specific ordering for data accesses on the I/O Data Bus. 8-Bit FIR Filter Relative Performance Digital Filter Design: The processing engine of most filter algorithms is a Multiply and Accumulate (MAC) function. DSPx4 SDA Figure 7. Note that these two examples will work V. The FPGA-based implementation replaced a programmable DSP and three SRAM chips.

Take for example a four-product MAC function that uses a conventional sequential shift-and-add technique to multiply four pairs of numbers and sum the results [see Figure 10]. contains only feed-forward terms. Figure 8. which uses more CLBs. “Parallel” Distributed Arithmetic techniques are used to achieve the fastest sample rates. Since the FIR response. 50 45 40 35 30 25 20 15 10 5 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Programmable DSP Performance # of Multiply & Accumulates per Sample DSP Region Case Study 16-Tap. Data Rate (MHz). is of no added value. mathematically. REG Data Input X[7:0] REG REG REG REG REG REG REG PSC 0 15 REG 1 14 REG 2 13 REG 3 12 REG 4 11 REG 5 10 REG 6 9 REG 7 8 C0 C1 C2 C3 C4 C5 C6 C7 DSP Processors 2 REG -1 Data Output Y[9:0] # of DSP Cores 4x-DSP MCM 3x-DSP MCM 2x-DSP MCM 1x-DSP Figure 9. consider the performance and cost advantages available using an FPGA. Optimizing the design for a specific configuration can often reduce the number of CLBs by as much as 20%.5 million samples per second with a 4xDSP multi-chip module. Compare the same 16-Tap FIR filter algorithm implemented in various state-of-the-art fixed-point DSPs with that of the Xilinx FPGA. which translates to a maximum sample rate of 3.1. A fixed-point DSP rated at 50 MIPS requires at least 20 nsec per Tap to implement a 16-Tap FIR filter. The design must work at the sample rate. 5 . while any additional performance. The performance of general-purpose DSP devices degrades significantly with each additional MAC (one additional Tap). 16-Tap FIR filter Data Flow Block Diagram with Symmetrical Coefficients FPGA Based 16-Tap FIR Filter: The FPGA has the ability to implement a FIR filter function using one of several Distributed Arithmetic techniques. FPGA Region Distributed Arithmetic: Distributed Arithmetic differs from conventional arithmetic only in the order in which it performs operations.125 million samples per second or 12. All rights reserved. If the system design requires a systolic array or uses more than one traditional DSP device. as shown in Figure 8.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin with any 16-Tap configuration with signed or unsigned. as shown in Figure 8. the FIR is unconditionally stable and can be designed to exhibit a linear phase response. while lower rates can be sustained with a “Serial” or “Serial-Sequential” Distributed Arithmetic techniques that use less resources (fewer CLBs). 8-Bit FIR Filter: The 16-Tap FIR is a discrete-time filter in which the output is an explicit function only of the present and previous inputs to compute the weighted average of the 16-data sample points. depending on the performance required. fixed point 8-Bit data and coefficients. A design which runs below the sample rate is of no value. V.0 © 1995 XILINX. The primary design concern is the performance or the sample rate of the filter. Inc. Note that these techniques can be used to optimize the implementation of many other types of data processing or MAC-based algorithms.

that eventually are summed into the output. If any data bit is a zero. Conventional four-product MAC using shift-and-add technique to multiply pairs and sum the result Each multiplier forms partial products by multiplying the coefficient-Y by 1-bit of the data-A at a time in an AND operation. Hence. Replacing these AND functions and the three adders with a simple 4-bit (16-word) Look-up Table (LUT) gives the final reduced form of a bit Serial Distributed Arithmetic MAC [see Figure 13]. Four-Bit MAC using shift-and-add technique to multiply The four-multiplications are performed simultaneously and the results are then summed when the products are complete. then the corresponding coefficient is eliminated from the sum. The least significant bit (the output from each serial shift register) of the 4-data samples addresses the LUT. & Y3). V. rather than after. Y1. the LUT output contains all 16 possible sums of the coefficients [see Figure 14]. This process requires n-clock cycles for data sample of n-bits. This technique reduces the number of shift-and-add circuits to one. thus compensating for the bit-weighting of the ingressing partial product [see Figure 11]. the output of the AND functions and the three adders of Figure 12 depend only on the four input bits from the shift registers. If all four data bits are 1. to perform a divide by two (÷2) each clock cycle. Y2. Serial Distributed Arithmetic for a four-product MAC. & PD]. LUT-Based Serial Distributed Arithmetic for a four-product MAC. the four-multipliers simultaneously create four-product terms [i. During each data clock-cycle. 6 . Using Distributed Arithmetic.e. based on the four inputs.1. All rights reserved. A S-REG Figure 10. then the output from the LUT is the sum of all four coefficients. referenced in Figure 14. Distributed Arithmetic differs from this process by adding the partial-products before.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin A Y0 SHIFT-REG A MAC Block + + MS LS REG Y0 Ai Y1 Bi Y2 Ci Y3 Di PA + PB + + PC + PD + + + + 1-Bit Scaling Accumulator MS LS REG S-REG Ai B P S-REG PA 1/2 C + S-REG B S-REG P Bi Ci Di Y1 Y2 PB MAC + + SUM D S-REG SUM 1/2 C PC MAC S-REG + + D S-REG Y3 PD MAC + Figure 12. the operations are reordered. in Figure 10. as shown in Figure-12. PA. PB. but does not change the number of simple adders. A*Y = P is. Because the address of the LUT contains all possible combinations of one or zero. This technique then adds these partial products into an accumulator that shifts the accumulator’s feedback 1-bit position. The LUT data. PC. the processing clock rate is equal to the data rate divided by the number of data bits. Inc. Consequently. to the right. the bit-weighted accumulation. is composed of all partial sums of the coefficients (Y0. The coefficients in many filtering applications are constants. { Start } A0 * ( Y 3 Y 2 Y 1 Y 0 )<1/2> { Cycle=0 } + A1 * ( Y 3 Y 2 Y 1 Y 0 ) <1/2> { Cycle=1 } + A 2 * ( Y 3 Y 2 Y1 Y0 ) <1/2> { Cycle=2 } + A 3 * ( Y3 Y2 Y1 Y0 ) { Cycle=3 } { End } P 7 P 6 P 5 P4 P3 P 2 P 1 P 0 C S-REG Ci P D S-REG SUM 1/2 Figure 11. A SHIFT-REG Y Ai MAC Block + + MS LS REG P Ai 1-Bit Scaling Accumulator + + Di MS LS REG P 1/2 B S-REG Bi LOOKUP TABLE LET A = A 3 A 2 A 1 A 0 and Y = Y 3 Y 2 Y 1 Y 0 then.0 © 1995 XILINX.. Figure 13.

A large LUT can be avoided by partitioning the circuit into smaller groups and combining the LUT outputs with adders. D9 S-REG + LUT P + SUM 1/2 DA S-REG DB S-REG + + LUT DC S-REG DD S-REG DE S-REG DF S-REG Figure 15. The output bit from the shift register addresses 1-bit of a corresponding LUT. Note that V. result in a growth factor of between 6 and 7. the width of the LUT is typically 2-bits greater than the coefficient width. The shift register then serially shifts out all the data-bits.0 © 1995 XILINX.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin A S-REG D0 S-REG Look-Up Table Ai ADDR_0 LUT Inputs < D i Ci Bi <0 0 0 <0 0 0 <0 0 1 <0 0 1 <0 1 0 <0 1 0 <0 1 1 <0 1 1 <1 0 0 <1 0 0 <1 0 1 <1 0 1 <1 1 0 <1 1 0 <1 1 1 <1 1 1 Ai > 0> 1> 0> 1> 0> 1> 0> 1> 0> 1> 0> 1> 0> 1> 0> 1> LUT Output Partial Sum 0000. In general. LUT-Based Serial Distributed Arithmetic for a four-product MAC. Contents of a 16-Word Look-Up S-REG + + 1-Bit Scaling Accumulator MS LS REG D8 S-REG In the four-MAC example. LUT.. The adders are less costly than the larger LUT. Inc. In a SDA-based design with multiple four-input LUTs. 1-bit at a time. If the FPGA architecture uses four-input function generators. The MAC has a series of parallel-to-serial converters. 7 . the size of the LUT grows exponentially. four LUTs the same size as the one used in the example of Figure 13 are used. Each additional Tap would also use a serial-in and serial-out shift register with its input fed serially from the previous cascaded Tap.1. Only the first word is parallel loaded. each occupying approximately the same space as one 4-bit LUT. The circuit will require at most 2-bits because the circuit is summing no more than four coefficients. The 16-MAC product and sum can be partitioned as shown in Figure 15. The only difference between a MAC circuit and a FIR filter design is how the input data is handled [see Figure 16]. The output bit from the first shift register (the first filter Tap) is the input to a serial-in and serial-out shift register (the second filter Tap). Fewer bits may be adequate in cases in which the combinations of coefficients result in less word growth. the number of additional bits should be at least Log2(number of Coefficients). each are configured in the same manner as was discussed with the single four-input LUT. As the number of coefficients increase. Table. When the shifting is complete. The additional two-bits allows for word growth from the addition of the four coefficients. In the example of the 16-Tap FIR filter. The first shift register operates by parallel-loading the “new” data sample into a registerbased shift register. until the most significant bit of n-bits has been clocked out. The additional three adders. Rather than increase the size of the LUT by a growth factor of 16. while the FIR filter has one long bit-wide shift register that taps into each word. All rights reserved. More than four coefficients require more width for word growth. the circuit would require a 16-bit LUT. the optimum partition is four products per LUT. You can use fewer bits only when the specific coefficients are known and the LUT has been computed. the first data shift register is empty and is ready to be parallel loaded with the next word and the process repeats.0 YA YB Y B +Y A YC Y C +Y A Y C +Y B Y C +Y B +Y A YD Y D +Y A Y D +Y B Y D +Y B +Y A Y D +Y C Y D +Y C +Y A Y D +Y C +Y B Y D +Y C +Y B +Y A D1 S-REG 16-MAC CIRCUIT LUT B S-REG Bi D2 S-REG Using 4-bit LUT-Based Distributed Arithmetic ADDR_1 C S-REG Ci LOOKUP TABLE ADDR_2 D3 S-REG D S-REG Di ADDR_3 O U T P U T + + LUT D4 S-REG D5 S-REG D6 S-REG D7 Figure 14..

25-CLBs at 7. This is a result of needing access to only one bit of each data sample (Tap) per clock cycle. By adding pipeline registers the design can be clocked at a higher rate. Pipelined input/output timing diagram for a SDA 16-Tap FIR Filter. The data in the remaining RAM-based shift registers are positioned to be multiplied by the appropriate coefficient by addressing its corresponding LUT during the next cycle. 16-Tap FIR Filter using LUT-Based Serial Distributed Arithmetic. A+B S-RAM LUT S-RAM REG 16-Tap FIR Filter Using Serial Distributed Arithmetic + + REG S-RAM S-RAM SUM REG A B A + B + C_IN C_OUT S-RAM LUT S-RAM REG S-RAM + + 1-Bit Scaling Accumulator MS LS REG REG C_IN P REG CLK S-RAM + LUT + SUM 1/2 S-RAM Figure 18. SDA uses the smallest number of CLBs while processing all data samples (Taps) in parallel. as shown in Figure 19. For the 16Tap FIR filter example. This is done with serial adders.0 © 1995 XILINX.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin the data flow can be changed to adapt to the input characteristics of any function. Figure 16. the shift registers could be constructed using registers or more efficiently implemented in a Synchronous RAM-based shift register. The Synchronous RAM-based shift register offers 16 times the density of a registerbased shift register (each CLB can be configured as two Registers. 8 . D0 S-REG The outputs from each LUT and ADDER can be registered in the same CLB that holds the LUT or ADDER at no extra cost. 4. All rights reserved. This also reduces the number of adder stages in the design. The oldest sample’s (the last Tap) data-bit is lost at the end of each process. an extra clock cycle is required to flush the Carry_Out from the addition of the two most significant bits of the input data [see Figure 18]. In a symmetrical filter the number of LUTs (or MACs) can be reduced by a factor of two.g. as shown in Figure 17.7 nsec. If the filter is symmetrical it is possible to further reduce the resources required for the design. A filter design implemented in an FPGA with SDA gives a significant amount of performance in a modest number of CLBs (e. The processing clock rate can be calculated by dividing the Sample Rate by the number of sample bits plus one.1. per Tap). Inc. S-RAM S-RAM + + LUT REG S-RAM S-RAM S-RAM REG S-RAM When serial adders are used to reduce the number of MACs in a Serial Distributed Arithmetic algorithm. V. the pipelining registers result in a continuous output data stream with a latency of 4-sample cycles. REG Serial Adder input/output diagram. DATA CLK INPUT_10 INPUT_11 INPUT_12 INPUT_13 INPUT_14 *** OUTPUT_06 OUTPUT_07 OUTPUT_08 OUTPUT_09 OUTPUT_10 *** *** *** Figure 17. two 16x1-bit Synchronous RAMs or one 32x1bit Synchronous RAM for data storage). This is done by adding the input data with a common coefficient prior to the multiplication. With the exception of the first parallel-in and serial-out shift register.

1] Ai SERIAL ADD A2 REG B SERIAL ADD A3 S-REG Bi LOOKUP TABLE PS_BITS_1 C Figure 19. With PDA. When the design is an n-Bit PDA. 2-Bit PDA uses a less modest number of CLBs then that of SDA. The performance for the 16-Tap FIR filter example.0 © 1995 XILINX. S-REG Ci + + LS 2-Bit Scaling Accumulator D S-REG Di R E G + + MS REG P LS 1/4 SUM A BITS[(n-2). such that one stores the even-bits and the other stores the odd-bits. in the case of SDA. The number of bits being processed during each clock cycle can be increased until an n-Bit PDA implementation is reached. There is also V. LUT-Based 2-Bit Parallel Distributed Arithmetic for a four-product MAC. compared with SDA). Inc. per Tap) compared to the same function with SDA. The performance for Serial-Sequential Distributed Arithmetic is rated at the maximum data rate for a given number of data-bits with SDA.3. The process is performed bit-serially. This technique uses an internal synchronous Dual-Ported RAM or FIFO to time multiplex each of the data samples..9 nsec.. for n-bit data samples.. through the logic. A filter design implemented in an FPGA with 2-Bit PDA results in twice the performance and about twice the number of CLBs (e. The 2-bit parallel data samples require twice the number of LUTs. 9 . referenced in the discussion on SDA. This technique is typically useful for data sample rates below 1 MHz.0] S-REG Ai B S-REG 1/2 LOOKUP TABLE Bi C S-REG Ci PS_BITS_0 D S-REG Di Figure 20. resulted with a sample rate of 16 MHz. Increasing the number of bits processed from 1-bit.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin D0 S-REG S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM T0 T15 T1 T14 T2 T13 T3 T12 T4 T11 T5 T10 T6 T9 T7 T8 SERIAL ADD A0 SERIAL ADD A1 LUT SERIAL ADD A2 REG Symmetrical 16-Tap FIR Filter Using Serial Distributed Arithmetic + + 1-Bit Scaling Accumulator MS LS REG REG SERIAL ADD A3 SERIAL ADD A0 + P the addition of a 1-bit scaling adder... the serial shift registers.1.. 2-Bit PDA results in twice the throughput. are each replaced with two similar 1-bit shift registers at half the bit depth. SUM + SERIAL ADD A1 LUT 1/2 A S-REG BITS[(n-1). to a 2-bit PDA results in half the number of processing clock cycles [see Figure 20]. the number of bits being processed during each clock cycle is increased. the number of parallel bits sampled should be increased only to meet the required performance.2. required to add the two partial sums which results from each of the two parallel sample bits. bits serially. Note that increasing the number of bits sampled has a significant effect on the number of CLBs used for the design. All rights reserved. With 2-bit PDA. Symmetrical 16-Tap FIR Filter using LUT-Based Serial Distributed Arithmetic. The scaling accumulator’s input bus is expanded to accommodate the larger partial sum and the final scaling accumulator is changed from a 1. word-sequentially (rather than bit-serially. divided by the number of Taps.2-CLBs at 3. word-parallel. These changes essentially double the resources required compared to that of the SDA design. implemented with a 2-Bit PDA algorithm. at twice the SDA data sample rate. A “Serial-Sequential Distributed Arithmetic” (SSDA) technique could be used at lower data rates to conserve more CLBs. Hence. in 130 CLBs [see Figure 21].g.5.to a 2bit shift (1/4) for scaling. The two parallel shift registers are spilt.. the sample data rate is at a maximum.. while still processing all data samples (Taps) in parallel. For an XC4000E-3 device this will enable a single chip solution with a sustained data sample rate between 50 and 70 MHz. Parallel Distributed Arithmetic: “Parallel Distributed Arithmetic” (PDA) is used to increase the overall performance of Serial Distributed Arithmetic.4. Therefore. 8. This paper does not include a formal discussion on Serial-Sequential Distributed Arithmetic.

8-Bit FIR Filter Using Parallel Distributed Arithmetic With PDA. V.0 D 4.0 D 0.1 REG T0 T15 T1 T14 T2 T13 T3 T12 T4 T11 T5 T10 T6 T9 T7 T8 SERIAL ADD A0 + LSB + LSB 1/4 REG + LUT SUM D 6.0 D 3.2 REG REG REG BITS_2 2-Bit Scaling Accumulator MS P REG D 1.3 D 3.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin D 0 Bits[7.2 REG REG + LUT SERIAL ADD A3 + D 0 Bits[6. while still processing all data samples (Taps) in parallel.1 REG REG REG Sign Extended MSB + REG SUM SERIAL ADD A1 LUT 1/2 REG REG + REG SERIAL ADD A2 REG + LUT + Sign Extended MSB 1/16 LSB PS_07 REG REG REG SERIAL ADD A3 + + REG BITS_1 + Sign Extended MSB + REG PS_03 D 1.0 D 1. as shown in Figure 22. The LUTs for SDA and PDA can always be the same for any given 4-MAC block.3 REG PS_47 LUT REG REG REG REG + REG SERIAL ADD REG + LUT REG A2 SERIAL ADD A3 + + REG PS_BITS_1 REG BITS_3 + Sign Extended MSB LSB D 1.3 D 4.2 D 0.1] S-REG S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM T0 T15 T1 T14 T2 T13 T3 T12 T4 T11 T5 T10 T6 T9 T7 T8 SERIAL ADD A0 SERIAL ADD A1 LUT Symmetrical 16-Tap FIR Filter Using 2-Bit Parallel Distributed Arithmetic REG D 7. 10 .0 © 1995 XILINX.g.2 D 7. A non symmetrical 16-Tap FIR filter.2 D 6.3 D 5.1 D 0. per Tap) compared to the same function with SDA.3.3 D 2.1 D 5. The performance for the symmetrical 16-Tap FIR filter example. resulted in a sample rate of 55 MHz.1 D 3.0 REG REG REG BITS_0 Figure 21. Filter REG First 4-Bits of 8-Taps in a 16-Tap.3 D 0. resulted in a sample rate of 58 MHz.0 REG 1/2 + REG SERIAL ADD A2 REG REG + LUT SERIAL ADD A3 D 2. 16-Tap FIR Filter using LUT-Based Parallel Distributed Arithmetic. All rights reserved.1.2 REG + SERIAL ADD REG A1 D 5.4. implemented with an 8-Bit PDA algorithm. in 400 CLBs. Figure 22.1 D 2.1 nsec.5.3 REG REG PS_23 REG SERIAL ADD A0 D 7.1 D 4. regardless of the number of bits in the sample data. This is true for PDA only if common bit-weighted sample inputs are used to address the LUT. as shown in Figure 22.2.0 REG + REG PS_O1 SERIAL ADD A1 LUT REG PS_BITS_0 REG LUT LSB REG D 5. Symmetrical 16-Tap FIR using 2-Bit Parallel Distributed Arithmetic. Inc. at the data sample rate. Full PDA uses a much larger number of CLBs then that of SDA.0 D 6. in 270 CLBs.0] S-REG S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM S-RAM PS_01 + REG D 2. each additional parallel bit requires an additional level of scaling (by powers of 2) and summation for each partial product pair of bits [see Figure 22].2 D 3. 19-CLBs at 1.1 LSB REG 1/4 SERIAL ADD A0 D 7.3 D 6.2 LUT REG REG 1/2 + REG LUT SERIAL ADD A2 REG D 4. The filter design example implemented in an FPGA with 8-Bit PDA resulted in more than seven times the performance and about four and a half times the number of CLBs (e.

Latest Version. [2] Goslin. NOTES: [7] Viterbi. JAN95. J. 8-Bit FIR Filter Application Guide. [5] Strauss. “Using Xilinx FPGAs to Design Custom Digital Signal Processing Devices. “Principles of Digital Communication and Coding. J. commonly called a gate array. DSP—Digital Signal Processing FIR—Finite-Impulse Response. NRE—Non-Recurring Engineering. The set-up or mask charges required to build an ASIC. 1965.” Forward Concepts. “16-Tap. SDA—Serial Distributed Arithmetic.com/appsweb. R. A. “DSP Strategies for the 90’s: The MidDecade Outlook. B.” Xilinx.0 © 1995 XILINX.com or FAX: 408-879-4442. CPLD—Complex Programmable Logic Device. Second Edition. 21NOV94. References: [1] Goslin. Xilinx WebLINX DSP Site Xilinx has a home page on the World-Wide Web that includes a special section for DSP.” Xilinx Publications. FPGA—Field Programmable Gate Array. MAC—Multiply/Accumulator. [6] “XC6200 Data Sheet. An alternate approach to implementing arithmetic functions. information or for help on implementing an algorithm in programmable logic. Latest Version. 12JAN95.” McGraw-Hill. A type of digital filter. The WebLINX UURL for DSP is http://www. IIR—Infinite-Impulse Response. G. & Newgard. Very efficient and a good compromise between speed and density in some FPGA architectures. Inc.1. A type of digital filter. Inc. HDL—Hardware Description Language such as VHDL and Verilog.” Xilinx. All rights reserved. New York. Glossary of Terms: ASIC—Application-Specific Integrated Circuit. DA—Distributed Arithmetic. An alternate approach to implementing arithmetic functions.” Xilinx. G. 595-604. PDA—Parallel Distributed Arithmetic. Also called EPLD or Erasable Programmable Logic Device. A DSP processor provides good performance in DSP applications because it executes a multiply and an addition in a MAC unit in a single clock cycle. contact the Xilinx DSP Applications Group. R.htm#DSP Application notes and data sheets are also available via WebLINX. K.” Proceedings of the DSPX 1995 Technical Proceedings pp. An alternate approach to implementing arithmetic functions. 11 . PSC—Parallel to Serial Converter. Inc. 1994 [4] “XC4000E Data Sheet. & Omura.A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance By Gregory Ray Goslin Other Resources: DSP Applications Group For more assistance in FPGA-Based DSP related applications. Efficient and high performance in some FPGA architectures. W.xilinx. Inc. V. Contact them via E-mail at dsp@xilinx. [3] “The programmable Logic Data Book.