FPGA Implementation of FIR Filters Using Pipelined Bit-Serial Canonical Signed Digit Multipliers

Shousheng He and Mats Torkelson
Department of Applied Electronics, Lund University S-22100 Lund, Sweden email: heQtde.1th.R; torkelQtde.lth.se

A b d r d - A pipelinable bit-serial multiplier using Canonic Signed Digit, or CSD code to represent constant coefflcients is introduced. A bit-eerial module for a(z f y ) t - l type computation is further developed. Optimization over discrete power-oftwo coefflcient space[l] haa been retargeted on this type of multipliers to generate minimized no-zero bit coefflcients. This also make it possible to confine the latency to be equivalent to the data wordlength without causing a lar e delay in partial product sum propagation. A singfe chip FPGA implementation of a full 16-bit 31-tap Hilbert transformer is used as an example to demonstrate the application of the multiplier module with the special consideration of FPGA architectures. It is shown that FPGA architecture is an ideal vehicle for thus optimized bitserial processing.

demonstrate the application.






For years numerous efforts have been made to reduce the implementation complexity of signal processors, which is measured by the Area-Time, or AT product. The two main aspects of this complexity is that of computation and communication. For computation complexity efforts have been focusing on minimizing the number of necessary operations, mainly that of multiplications, e.g.[l] and efficient implementation of such operations, which is represented by signed digit algorithms] such as modified Booth coding[2] and CSD coding[3] algorithms. Although CSD coding algorithm has been proved to be optimal as for reduction of the non-zero digits[4], it has found only very limited applications[5] due the coding complexity and the varying operation delay. For communication complexity reduction] one attractive solution is to use bitserial architecture, which is distinguished by the efficient inner and inter chip communications and small, tightly pipelined processing elements[6]. Various bitserial multipliers have been built based on modified Booth algorithms[I. Recent development on Field Programmable Gate Array, FPGA[8], has presented a user programmable, regular, register-rich architecture with abundant local and global connection resources. This architecture is very attractive to bit-serial and bitlevel systolic processing. In this paper we will first present a CSD coding multi lier for fixed coefficients and then enhance it into a& f y)z-I type operation. The technique for o timization FIR coefficients over power-of-two space{] is retargeted to this type of multiplier with the constraints on the non-zero signed bits to reduce overall complexity. A single chip implementation of a full 16bit 31-tap Hilbert transformer is used as an example to

CSD coding technique, similar to Booth coding, is a signed digit notation, in which each digit is to have three possible values: {T,O, l}, where 1 represent -1. CSD code has the property that it is unique (canonic) and requires minimal number of non-zero digit in its representation[4]. It takes a value-dependent steps of iterations to convert a two’s complement code or any other non signed digit code) into CSD code. T is has been the main obstacle for its application in multiplier designs. For fixed coefficient multiplication, like most appear in digital filters, the conversion can be carried out in advance and the coefficients can be copidered as a constant vector of digit from the triple {1,0,1}. Assume without loosing generality, that both the data and coefficients before CSD coding are in n-bit two’s compliment representation, as a fractional number in the range of -1 < 2 < 1, a multiplication will generate a result of precision of 2n - 1 bits, out of these bits, only n Most Significant Bit, or MSB, will be taken to represent a rounded production in a pipelined multiplier.


data propagation




partial produd s u m shift

Figure 1: Bit-serial pipeline multiplier The initial bit-serial multiplier to perform multiplication by a CSD coded coefficient can be derived from minor modification of the primitive pipelined multiplier proposed by Jackson, Kaiser and McDonald 91, where the coefficient bits are applied parallel in a Least Significant Bits, LSB, first (to leftmost module and data in a LSB first stream, Figure 1. The coe cient bits are replaced by CSD coded bits, as shown in Figure 2 for the coefficient bits to be “l”,“0” and ‘‘T respectively. Carry set/reset and sign extension, obtained



0-7803-1886-2194 $3.00 Cl 1994 IEEE

On the k-th step.3.2. The CSD multiplier modules in above section can ke further extended to include operation of 1 * (z y). where the final product comes out of the Is terminal. but also introduce some extra latency in the processing. Figure 3: a(z f y ) t ..e.1 bits while the corresponding carries are saved to next accumulation..= 1 * (y .5. the modules through 1s .s i terminals to allow an increased bit rate. that most efficient implementation of FIR filter will need a module t o calculate L 3.1. each accumulated partial product is truncated to k .. 111 OPERATION MODULES IN DESIGN FIR FILTER 1' (x-yl A prominent property of FIR filters is that linear phase response can be obtained by imposing symmetry or antisymmetry conditions on the coefficients: h ( i ) = f h ( N .I This linear phase character has also the advantages in implementation that only half number of the multipliers are required since the system function can be written. \ . as can been seen in above equations. Assume a symmetric passband has been chosen.1)-th LSB and truncating at the corresponding step. It is obvious.. 1 * (z y) and 1* (z.l operator I n n i nri -1Figure 2: Bit-serial modules for CSD coded coefficient multiplier a(z f y)?-' .. . and insert registers to both the data and synchronizing signal path t o align the partial product accumulation.. as Figure 3 shows.z .2 82 . as: Iv FPGA IMPLEMENTATION OF A HILBERT TRANSFORMER Hilbert transformers are frequently used in communication system and signal processing. Note that signal applied to these module has to be scale down by half to prevent overflow of the addition. . Rounding is obtained by adding an offset 1 at the (n . Ideal Hilbert transformer has infinite impulse response and is non causal. the multiplier will have a desirable n bit-time latency. N .the operation can be ) since * ed by just exchanging the z. An l+(z-y) moduleis shown in Fi ure4.1-i) i=0. f o r u e v e n i = 1.for+odd Direct employment of bit-serial multiplier described above to FIR filter design will not only cause some hardware redundancy. will be cut by half[lOj: h(i) = 0 0 i = 0.y)z-' operation . An alternative is to connect .y terminals of accomplis 1 * (z.with an extra flip-flop. + + 6" If the modules are cascaded through so --c si terminals except the final stage. .y) module is not necessary y) . for N even and odd respectively. fl*(z+y) modules are similar except minor tifferences in carry set/reset circuitry.. it is sufficient that its bandwidth covers the bandwidth of the desire signal t o be phase shifted. Figure 4: Bit-serial module for a(z . are provided to allow two's complement computation.4. An ideal Hilbert transformer is an all pass filter that impart a x / 2 phase shift on the signal at its input. before the multiplication takes place. or wordlen th latency but the performance will be degraded by t i e partial sum propagation delay if the CSD coded coefficients has relatively more non-zero bits.y). This will also double the latency of the multiplier. T t (z.3. In practise. the number of multipliers..y) module. an odd length is preferable since the computational complexity i.

ooooioiooogoo 0.49(DB) ripple ootimization for 14-4bit CSD 0.oooiooooioooo 0. where n is the data wordlength.fu = 0.5 . IV. Choosin normalized cutoff frequency f. Table 1: Non-zero coefficients and their csd multipliers float I 142bit CSD I h(n) 11 h( 0) 11 0. and local 16-state counter can be implemented with 2 CLBs.e.63134944 o. The correct sum propagation is instead accomplished by a more elaborate control of the si n extension register by the local counter.03440993 0. T a h e 1 shows the optimization of the coefficients (only non-zero ones are displayed) uswith minimum control for 14bits CSD code bits respectively. IV. as shown in Figure 3. a complete implementation of addermultiplier-delay module. this greatly reduces the area consumption and routing congestion in the implementation.001oO01001010 q I o.00928547 0. which provides only parallel interface to signal processors being prototyped. which are arranged in logz8 = 3 level. By converting the result t o CSD code.05955382 0. Each nonzero multiplier bit module is implemented in 3 CLBs. Thus. LCA from Xilinx[8] are used as the vehicle for implementation of CSD multiplier based FIR filters. There are 8 adders to accumulate the multiplication output.05.19684343 o. including the 3 bit-time delay shared with the data passages. and the processing/multiplication counter pipelined the operations and absorbs the latency by the post multiplication adders.oooo1001Mxw)(1 0. 2 non-zero bit coefficients are used in following example. n = 16. and c the number of CLB required for the local counter.0000001010000 0. a parallel/serial conversion has been implemented on the same chip.4 COMPLEXITY AND PERFORMANCE d. Iv. this feature also enable the trade-off between registers and combinational logic when one of them are getting scarce due to heavy employment. example. C the non-zero bits in CSD multiplier.Oooiooimi 0.d E 83 .1. Its Configurable Logic Block. impulse response) over power-of-two space.46(DB) -64. The shift between the global finite state machine. instead of propagate the synchronizing signal as in Figure 2. Except a ain response is satisfactory wi$ 2 non-zero bits and near equivalent to that of infinite precision with 4 non-zero bits if the normalized ripple is used as performance criteria.Iv.5C c CLBs.1916969 gam -71.10303845 o.00420406 I 0. Furthermore. requires n + 1.3 PARALLEL/SERIAL CONVERSION AND SYNCHRONIZATION + To prototype the bit-serial processor on a SBus base desk-top prototyping system[ll].iooi~iooooio.1 COEFFICIENTS OPTIMIZATION There are algorithms using integer programming technique to optimize the FIR filter coefficients (. CLB array architecture is considered favorable for bitserial procession because the flexibility of the Look-UpTable based functional cell can implement any function up to 5 variables without extra routing penalty. The glue fogic share the same CLB with the neighboring bit-time delay.d.2 FPGA IMPLEMENTATION CONSIDERATION 4 Figure 6: Timing diagram for processing control and parallel-serial conversion In our approach the widely available medium grain FPGA.0000000101000 0.0000000010010 1 0. All the zero value bits is eliminated from the partial product accumulator. In this example.0000010100000 0.0000000101001 8 1234 I I I I I I I I 4 5 6 a 9n i a 3 4 s ~ r Figure 5: An 8bits implementation example of a ( z operator with 2 non-zero bits + y)z-' o. these put together a total 16+3+2=21 CLB for one such multi-functional module. Its control logic will also synchronize the operation of the post-multiplication adders which introduce extra latency. this technique can be retargeted to the CSD multipliers introduced in previous sections. a global synchronized local processing counter is assigned t o each multiplier as shown in a 8bit multiplier shown in Figure 5. XC3090 Logic Cell Array. Figure 6 shows the timing diagram for synchronizing the control of handshaking.oooo1010101oo o. For.i.ioioooooooooo &4j 1.01884848 0. = 0. The direct input t o the embedded flip-flops enables two bittime delay and a 4-variable function to share the same CLB. parallel/serial conversion and processing operations.ooooooioioioo o.oooooioioiooi o.ooioooioooooo 0. which control the handshaking and parallel/serial conversion. k = 2.

which occurs in multipliers. B. which can be measured[lf 4 non-zero bits in the multiplier can produce a near equivalent performance. Jackson. 1960. and insert a flip-flop in the middle of the 4 stages the multiplication can be finished in wordlength latency with maximum logic level equivalent to 2 stages. Xilinx. This is a very desirable character of the bit-serial multiplier since it enables multi-functional module. Austria. 1985.00 Figure 7: Frequency response of FIR Hilbert transformer V CONCLUSION AND DISCUSSION A bit-serial pipelined CSD multiplier has been presented. Advances in Computera. C. Inc..00 t. outperformancea some dedicated commercial DSP chips. REFERENCES Y. Introduction to Signal Processing. using 8 concurrent multipliers. Macrocell design for concurrent signal processing. Homayoon Sam and Arupratan Gupta. By scale the coefficients one bit down. 16 adders. Speech. IEEE Trans. XACT the programmable gate array deuelopment system. This has been shown in a single chip implementation of a full + Wesley. Macmillan Publishing Company. VLSI SIGNAL PROCESSING: A Bit-Serial Approcah. Lim for offering the program nILP for FIR coefficient optimization over power-of-two space. pages 395-412. In Proceedings of 2nd International Workshop on Field. Lyon.3. The implementation has been fully automatically placed and routed with the standard XACT 121 software without the usual difficulties for such hig utilization. Two's complement pipeline multiplier. 1989. plus 16 addition. Signal Processing. John G. J. The reduction of communication complexity and the simplification of global control easea the implementation/performancebottleneck due to placement and routing.50 1. This saves logic and simplified the overall designs since in most case no timing alignment as in[6] is required. On the hardware implementation of digital signal processors. Optimization over power-of-2 space can be retargeted to this multiplier. Acoust. Proceedings of the Third Caltech Conference on Very Large Scale Integration. In Bryant. 1989. which is allowable in most cases. 32 unit-delays (each consists 16 bit-time delay. Binary arithmetic. Audio and ElectroAcoust. Manolakis. An approach to the implementation of digital filters. such as in IIR filters. ASSP-31(3):583591. or flip-flops) and parallel/serial conversion. George W. FIR filter design over a discrete power-of-two coefficient space. 1983. It has comparatively low AT product since each non-zero multiplication bit in the shared functional module cost only 1. such as XC3090 from Xilinx has been found to be quite suitable vehicle for implementing bit-serial processing. which accomplishes a 20*8/16=10M/sec multiplication rate. Parker. Y. in term of CLBs. august 1992. and pipelined operation in feedback loops. R. Inc. C. including all the 8 multipliers. Signal Processing. Proakis and Dimitris G. FPGA application in a SBus bases rapid prototyping system for ASDSP. ASSP-24(1):76-86. The register-rich medium grain FPGA. Commun.The FPGA implementation of the Hilbert transformer. F. such as multiply-delay module U Z Z . Reitwiesner. Peter Denyer and David Renshaw. t a k e 312 out 320 CLBs available on a single XC3090 chip. 16bit 31-tap Hilbert transformer. IEEE Trans. June 1983. Kaiser. A generalized multibit recoding of two's complement binary numbers and its proof with application in multiplier implementations. 1:231-308. 0. Brodersen. Abraham Peled. Addison- 5.5 extra CLB and the pipelined bit-serial processing enable higher data rate. F. McDonald. february 1976. editor. IEEE Trans. Pope and R.Programmable Logic and Applications. Acoust. Direct connection of the modules by cascading keeps the maximal logic level under k 1. IEEE Trans.00 o s 1. IEEE Trans... 1989. S. VI ACKNOWLEDGMENT 6 The authors wish to thank Dr.4 84 . in terms of CLBs. P. COM-24(4):418-425. 1968.00 ZJO 3. august 1990. If the application problem has ood discrete optimizability. where k is the number of non-zero bits. S. The maximum logic level is 3. L. It is expected that more complex tasks will be implemented on FPGA in the form of bitserial or bit-level systolic processing. C-39(8):1006-1015. The Programmable Gate Array Data Book. Speech. Running up to 20MHz clock rate using a 70-grade chip has been tested. Lim and Sydney R. control logic. Figure 7 shows the amplitude of the impulse response of the Hilbert transformer under test. and H. due to the reduction of the communication complexity in the bit-serial processing and the evenly spreading of the circuitry over the entire chip area. Vienna. W. Shousheng He and Mats Torkelson. AU-16(3):413-421. Comput.~ and a(3: f y)x-' .. Xilinx.. 1976.