You are on page 1of 5

International Conference on Computational Intelligence and Multimedia Applications 2007

A High Speed Block Convolution using Ancient Indian Vedic Mathematics


Hanumantharaju M.C Jayalaxmi .H Renuka R.K Ravishankar .M

Dept of E & C Acharya Institute of Technology Bangalore-560090.


(hanu2005@yahoo.com)

Dept of E & C Acharya Institute of Technology Bangalore-560090


(sohan_jm@yahoo.co.in)

Dept of E & C Acharya Institute of Technology Bangalore-560090.


(renu_kajur@yahoo.com)

Dept of ISE Dayanand Sagar College of Engg. Bangalore-560078.


(bmsearch@gmail.com)

Abstract
In Digital Signal Processing applications, the convolution with a very long sequence is often required. In order to compute convolution of long sequence, Overlap-Add method (OLA) and Overlap-Save method (OLS) method can be considered. The OLA and OLS are well known efficient schemes for high-order filtering. The most commonly used implementation for digital filtering algorithms are Digital Signal Processors, special purpose Digital Filtering chips and Application Specific Integrated Circuits (ASICs) for large volumes. In this paper, a high performance, high throughput and area efficient architecture for the Field Programmable Gate Array (FPGAs) implementation of block convolution process is proposed. The most significant aspect of the proposed method is the development of a multiplier architecture based on vertical and crosswise structure of Ancient Indian Vedic Mathematics and embedding it in OLA and OLS methods for improved efficiency. The coding is done in VHDL (Very High Speed Integrated Circuits Hardware Description Language) and the FPGA synthesis is done using Xilinx Spartan library. The results shows that OLA and OLS method of block convolution implemented using Vedic multiplication is efficient in terms of area/speed compared to its implementation using conventional multiplier architectures.

1. Introduction
With the latest advancement of VLSI technology the demand for portable and embedded Digital Signal Processing (DSP) systems has increased efficiently. Using programmable devices for DSP applications could narrow the gap between the flexibility of General Purpose Processor(GPP), Programmable DSP(PDSP). FPGAs are being increasingly used for variety of computationally intensive applications. In digital signal processing convolution is a fundamental computation that is ubiquitous in many application areas. Computing linear convolution by a sequence of circular or periodic convolutions concerning suitable finite blocks of the input data is a well-known method for efficiently computing convolution called Fast Convolution[7]. In filtering a speech waveform, the input signal is of indefinite duration theoretically processing, such a scheme is often cumbersome. As it is well known, if the output of a filter is computed for a block of samples at once, the number of operations can be saved. The modified over-lap add algorithm[1] executes faster than the traditional algorithm but incurs an additional delay of samples which are implemented using Matlab. The extended overlap-add method and the extended overlap-save method[3], utilizing FFT is more efficient than the polyphase structure but the implementation leads to high efficiency when employing a highorder FIR filter. The uniformly partitioned fast block convolution algorithm[5] with arbitrary delay performs the fast block convolution with the same cost as the conventional method at 2n delays, and also performs arbitrary length other than 2n with additional moderate computations

0-7695-3050-8/07 $25.00 2007 IEEE DOI 10.1109/ICCIMA.2007.332

169

but creates some redundancies in the output data because of more overlapping. In this paper a novel multiplier architecture[6] based on Vertical and Crosswise structure of Ancient Indian Vedic Mathematics is embedded into OLA to improve its efficiency in terms of speed and area. A Virtex FPGA is used to implement Overlap-Add and Overlap-Save method with block length L, and filter duration of M(L>>M). The organization of the paper is as follows: Section 2 presents multiplier architecture based on Vertical and Crosswise structure of Ancient Indian Vedic Mathematics. The Section 3 describes Overlap-Add Convolution method. The Section 4 describes FPGA implementation of block convolution process. Section 5 describes experimental results. Finally the paper is concluded in Section 6.

2. Multiplier Architecture
The Multiplier Architecture is based on the Vertical and Crosswise algorithm. The architecture is illustrated with two 8-bit numbers; the multiplier and multiplicand, each are grouped as 4-bit numbers so that it decomposes into 44 multiplication modules. After decomposition, vertical and crosswise algorithm [9] is applied to carry out the multiplication on first 44 multiply modules. The results of first 44 multiplication module are utilized after getting the sub product bits parallely from the subsequent module to generate the final 16-bit product. Hence any complex NN multiplication can be efficiently implemented by using small 44 multiplier using the proposed architecture where N is a multiple of 4 such as 8, 12, 16, 20, 24...4N. Therefore efficient multiplication algorithm implementation with small numbers such as 4-bits, can be easily extended and embedded for implementing efficient NN multiply operation. The algorithm for the proposed architecture for 88 bit (AB) number will be as follows, the 8 bit multiplicand A can be decomposed into pair of 4 bits AH-AL, similarly multiplicand B can be decomposed into BH-BL as shown below. The 16 bit product can be written as shown below P = A B= (AH - AL) (BH - BL) = AH BH + AH BL + AL BH + AL BL (1) A3 A2 A1 A0 Multiplicand [8 bit] A = A7 A6 A5 A4 X0 X1 Multiplier [8 bit] B = B7 B6 B5 B4 B3 B2 B1 B0 Y1 Y0 Where X0, Y0, X1, Y1 are each of 4-bit-numbers Parallel Computation & Methodology 1. 2. 3. CP CP CP X0 = X0Y0 = P0 Y0 X1 X0 = X1Y0 + X0Y1 = P1 Y1 Y0 (2) (3) (4)

X1 = X1Y1 = P2 Y1 CP= Cross Product P0, P1, P2= Partial Product

170

3. Overlap-Add Convolution Method


The overlap-add method [7,8] is formulated in this section. The filtering of the input sequence x[n] with h[n] using OLA can be implemented as shown below. Step1: The input sequence x[n] is segmented into blocks of length L and M zeros are appended after each segment to make L point input data blocks as shown in Figure1.

x[n] = xr[n rL ]
r =0

(5)

Where

x[n + rL],0 n L 1 xr[n] = 0, otherwise

(6)

xr[n] is the rth sample of the sequence x[n] .The block size L has to be chosen to satisfy the
finite duration length L> L+P-1. The segmented sections are shown in Figure 2.

Figure 1.Finite- length impulse response h[n] and indefinite-length signal x[n] .

Figure 2. Decomposition of xr[n] into non


Overlapping sections of length.

Step 2: Perform the Circular Convolution of the input data block xr[n] and impulse response h[n] with length L. Step 3: Overlap and add the output data block yr[n] together to obtain the overall filtered sequence y[n] = x[n] h[n] =

y [n rL]
r r =0

(7)

Where yr[n] = xr[n] h[n] , which is shown in Figure 3.

171

0.5 y0[n] 0 -0.5

10

20 n

30

40

50

1 y1[n] 0 -1

20 n

40

60

1 y2[n] 0 -1

20 n

40

60

Figure 3. Convolution of each section with h[n]

4. Implementation on FPGA
In this paper, the block convolution algorithm is implemented in VHDL (Very High Speed Integrated Circuited Hardware Description Language) and logic simulation is done in Modelsim XE III 6.0d Simulator. The Synthesis and FPGA implementation is done using Xilinx ISE 9.1i. The design is optimized for speed and area using Xilinx, device family: Virtex XCV300e, package BG432, speed grade-8. The Xilinx Virtex XCV300-8 device is to be applied and the device contains 154 Configurable Logic blocks, 77 slices and 136 four input Look Up Tables and 33 bonded Input/Output pads.

5. Results and Discussions


The Proposed Vedic Multiplier Architecture achieves a significant improvement in performance over the traditional multiplier architectures. The Figure 4 shows that, as the number of bits in the multiplier increases the proposed parallel multiplier architecture takes the dominating role over the traditional architectures of the multipliers. If the bits in the multipliers are continuously increased to NN (where N can be any number) bits the proposed parallel architecture has the greatest advantage as compared to other architectures of the multipliers over gate delays and regularity of structure.
70 60 T im e D e la y 50 40 30 20 10 0 8*8 Bit Traditional Array Multiplier Traditional Booth Multiplier Overlay Array Multiplier Vedic Multiplier Overlay Booth Multiplier

12 Time for Different m ethods 10 8 6 4 2 0 2.00E+17 2.00E+18 2.00E+19 2.00E+20 Input sequence length

Traditional over-lap(secs) Modified overlap(secs) FPGA(ns)

Figure 4. Comparison of multipliers with respect to timing delay in virtex FPGA.

Figure 5. Comparison of block convolution with respect to timing delay.

172

Table 1 Execution Times for Traditional, Modified and FPGA Implementation of OLA
Input sequence length Traditional over- lap(sec) Modified overlap (sec) FPGA(ns) 217
0.21 0.12 36.85

218
0.79 0.42 36.89

219
2.93 1.55 36.91

220
11.28 5.77 36.92

The execution time of the proposed method is compared with that of different implementation approaches. It has been found on embedding Vedic multiplier architecture for OLA, there is a considerable improvement in their performance as compared to traditional method of implementations. The results are grouped in Table-1 for different sequence lengths.

6. Conclusion
In the Proposed method the FPGA Implementation of OLA Algorithm is presented. In order to explore the advantage of OLA algorithm over other implementation approaches, the vedic multiplier has been embedded. The proposed multiplier architecture has the advantage that, as the number of bits increases its gate delay and area increases very slowly as compared to other multiplier architectures. It has been demonstrated that the further hierarchical decomposition of 44 modules into 22 modules will not have a significant effect in improvement of the multiplier efficiency or in other words multiplier decomposition nearly reaches a saturation level in its efficiency at 44 decomposition. The execution time for block convolution using FPGA reduces from 0.12s to 36.85ns. It is found that the design are quite efficient in terms of silicon area and speed and should result in substantial savings of resources in hardware when used for signal and image processing applications. References
[1] Madihalli J. Narasimha, Modified Overlap-Add and Overlap-Save Convolution Algorithms for Real Signals ,IEEE Transactions on Signal Process, vol. 13, no. 11, pp. 669-671, Nov. 2006. [2] Madihalli J. Narasimha, Linear Convolution using Skew Cyclic Convolutions, IEEE Transactions on Signal Process, vol. 14, no. 3, pp. 173-176, Mar. 2007. [3] Shogo Muramatsu, Hitoshi Kiya, An Extended Overlap-Add and Save Methods for Multirate Signal Processing , IEEE Transactions on Signal Process, vol. 45, no. 9, pp. 2376-2380, Sep 1997. [4] John W. Pierre, A Novel Method for Calculating Convolution Sum of Two Finite Length Sequences IEEE Transactions on Education, vol. 39. no. 1 , pp. 77-80, Feb 1996. [5] Jung Kap Kuk and Nam Ik Cho Block convolution with arbitrary delays using fast fourier transform in Proceedings of IEEE International Symposium on Intelligent Signal Processing and Communication Systems, Dec 2005. [6] Hanumantharaju M.C and Shashidhara K.S A Novel Multiplier Architecture for FIR Filter Based on Field Programmable Gate Arrays, IEEE International Conference on Signal and Image Processing, Hubli, Dec 2006. [7] A. V. Oppenheim and R. Schafer, Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975. [8] J.G Proakis and D.G Monalkis, Digital Signal Processing. Macmillian, 1988. [9] A.P Nicholas, K.R Williams, J. Pickles-Vertically and Crosswise applications of the Vedic Mathematics Sutra, Motilal Banarsidass Publishers, Delhi, 2003. [10] Uwe Meyer- Base, Digital Siganl Processing with Field Programmable Gate Array, Springer Inc, Heidelberg 2004.

173

You might also like