You are on page 1of 24

European Journal of Scientific Research ISSN 1450-216X Vol.33 No.1 (2009), pp.6-29 EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.

htm

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation
Saad Rehman Department of Engineering and Design, University of Sussex, Brighton, UK E-mail: S.Rehman@sussex.ac.uk Rupert Young Department of Engineering and Design, University of Sussex, Brighton, UK Chris Chatwin Department of Engineering and Design, University of Sussex, Brighton, UK Phil Birch Department of Engineering and Design, University of Sussex, Brighton, UK Abstract In this paper we present a hardware architecture for the Sum of Absolute Difference (SAD) technique. This paper also gives the design details and the implementation results for an FPGA based core that permits realisation of a high speed matching algorithm for real time image processing applications. The matching criterion chosen is the SAD algorithm The implementation provides the correct position of the target within the frame/image. The ease of implementation lies in the fact that the core is highly parameterized and therefore can cater effectively to the needs of different sizes and resolutions of images and filters. The high speed and the low area of silicon usage make it useful for a number of image processing applications. The paper also gives a review of different hardware architectures.

Keywords: Sum of Absolute Difference, Hardware Architecture, FPGA, Parameterized, Core, Algorithm

1. Introduction
Sum of Absolute Difference (SAD) is mainly used for motion estimation [1-3]. The compression ratio declines between two consecutive frames if there is a moving object. This requires the correct detection of the object to keep the compression ratio at an optimal level. SAD simply subtracts the pixel values in consecutive frames. If the answer is zero the object is detected. SAD is used as a similarity measure in case of block matching in several image-processing applications. SAD takes the absolute value of the difference between each pixel in the kernel block and the corresponding pixel in the target image block. The differences are accumulated to create a metric of block similarity. In order to find the similarity of a kernel image (generally smaller in size) in a bigger image, known as the target image, the kernel is placed on the image and the SAD is calculated; it is then moved to the next location on the image and the SAD is again calculated. This is repeated until the entire image SAD is calculated. The greater the value of the SAD operation the lesser will be the

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation

similarity between the kernel and the specific part of the image for which the SAD is specified. A perfect match will result in a SAD of 0. Several other similarity measure metrics are also used for some specific applications. Similar to Sum of Absolute Differences (SAD), Mean Absolute Difference (MAD) [4] is also used. Another similarity measure is Weighted Euclidean Distance (WED) [5] generally used to compare two iris patterns. Normalized cross-correlation [6] is also used between the acquired image and a database images for calculating goodness of match, for example in the case of an Automated Fingerprint Identification System (AFIS) system [7].
Figure 1: SAD Calculations on an image with a 4x4 kernel as a moving window operation

F11 F21 F31 F41

F12 F22 F32 F42

F13 F23 F33 F43

F14 F24 F34 F44

I15 I25 I35 I45

I16 I26 I36 I46

I17 I27 I37 I47

I18 I28 I38 I48

I11 I21 I31 I41

F11 F21 F31 F41

F12 F22 F32 F42

F13 F23 F33

F14 F24 F34 F44

I16 I26 I36 I46

I17 I27 I37 I47

I18 I28 I38 I48

. . .

. . .

F43

I11 I21 I31 I41

I12 I22 I32 I42

F11 F21 F31 F41

F12 F22 F32

F13 F23 F33 F43

F14 F24 F34 F44

I17 I27 I37 I47

I18 I28 I38 I48

I11 I21 I31 I41

I12 I22 I32 I42

I13 I23 I33 I43

F11 F21 F31 F41

F12 F22 F32 F42

F13 F23 F33 F43

F14 F24 F34 F44

I18 I28 I38 I48

. . .

F42

. . .

Because of the simplicity of the SAD it is considered to be a very fast and effective metric for similarity measure calculation in images. It takes into account all the pixels that are present in a window without attaching any specific bias to any values. Since the calculation of SAD for one block in an image is done without affecting the calculation for other blocks in its vicinity, we can calculate the SAD in parallel. Parallelizable architectures are easily synthesized on FPGAs that are fast reconfigurable devices. Powerful FPGAs available today facilitate the high-speed parallelizable and scalable implementation of such architectures. The most attractive feature of these semiconductor devices is the reconfigurability according to the user application. It not only gives design flexibility but also reduces the design to market availability time. There are several venders providing FPGAs with many different logic block capacities and other application specific cores and features. The main advantage of reconfigurablity is that the SAD micro engine design can be tailored to the needs of the application for which it is being used. It can be modified to accommodate different sizes of kernels, different sizes of images, different frame rates (up to a certain maximum speed) and different error tolerances and be allowed for a legal match of a kernel with the image.

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch

The SAD calculation in hardware is carried out by SAD micro engines, each capable of carrying out the sum of absolute difference calculation for a block of 4x4 bytes (gray scale image, range of values from 0 to 255). For SAD calculation of bigger blocks, several copies of SAD microengines are used as an array of micro engines. The hardware designed has some important features which make it generic and robust in nature. Those features are explained below; The size of the image is not fixed. The user can provide any size providing that it should fit in the memory. The size of the kernel is also generic. The minimum size for the kernel is 4 x 4. In order to speed up the process more kernels can be used in parallel. A modified Dadda multiplier is used as a partial product reduction for addition. This new technique can be used to speed up the whole process. The hardware also keeps check on the row and columns and the exact position of the object can be located in the image. The hardware has three stages of pipeline to increase the throughput.

2. Previous Work
SAD is one of the most popular techniques for motion estimation in digital video encoding systems. It is fast and relatively cheap to implement. Lengwehasatit and Ortega proposed Probabilistic PartialDistance Fast Matching Algorithms for Motion Estimation [8]. This paper proposes a fast matching algorithm to speed up the computation of the matching metric used in the search, that is the sum of absolute difference (SAD). Wong, Vassiliadis, and Cotofana [1] of Delft University of Technology have presented a SAD FPGA based hardware design. Their paper proposes a new hardware unit that performs a 16 x 1 SAD operation. It was shown that the 16 x 1 SAD architecture used can be easily extended to perform the 16 x 16 SAD operation, which is commonly used in many multimedia standards, including MPEG-1 and MPEG-2. Guevorkian, Launiainen, Liuha and Lappalainen [9] proposed efficient architectures for computing SAD in its application to motion estimation in mobile video coding. Higher performance is achieved despite lower gate count and power consumption. Wei and Zhi Gang [10] presented a novel SAD Computing Hardware Architecture for variablesize Block Motion Estimation. This hardware design was then implemented onto an FPGA. The proposed architecture with a 16x1-PE array, a 4-stage adder tree and two flexible register arrays supports 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 block SAD calculations. McBader and Lee [11] have designed a programmable parallel architecture that is used for signal pre-processing in intelligent embedded vision systems. The architecture was implemented and tested using a Celoxica RC1000 Prototyping Platform with a Xilinx XCV2000E FPGA. Ates and Altunbasak [12] showed that in order to reduce the encoder complexity, a hierarchical block matching based motion estimation algorithm that uses a common set of SAD computations for motion estimation of different block sizes can be used. Based on the hierarchical prediction and the median motion vector predictor of H.264, the algorithm defines a limited set of candidate vectors. The optimal motion vectors for all partitions are chosen from this common set. Ambrosch, Humenberger, Kubinger and Steininger of Vienna university of technology [13] designed a SAD based stereo vision real time core algorithm that is used for automotive applications.

3. Hardware Design
The block diagram of our system is given below. The images are taken from a video or digital camera and are the input to the system. The Frame Grabber board is there to convert the input video into digital

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation

frames and send them to the processor for further processing. A duo or quad processor should be the preferred choice for efficient processing. For sake of record keeping the processor application may choose to save the frames of the video into the hard disk. The frames are sent into the SDRAM of the FPGA board so as to carryout SAD and find the required kernel image in the video. The configuration of the SDRAM is done before starting the video streaming on to the FPGA for the SAD. The results are also stored into the SDRAM, i.e. match or no match, and the result index in the image in case of a match.
Figure 2: Block diagram of the system

SDRAM

FPGA

Grabber Board

PCI BUS

Analog Data from Video Camera

Frame

SYSTEM BUS

RAM

Processor

Sata Interface

Hard Disk

3.1. Kernel and Configuration Memory Organization The application software running on the computer configures the FPGA according to the needs of the kernel size, the image size and the required tolerance for match. Figure 3 shows the on-board memory organization. The SDRAM is divided into 4 major blocks.

10

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


Figure 3: FPGA board interface and memory organization
SDRAM

Image frame 0 Image frame 1 SAD cofigurations SAD Results


PCI bus SDRAM IF

FPGA

PCI interface
PCI bus

Table 1:

SDRAM division on the FPGA board

SDRAM blocks Image Frame 0 and Image Frame 1 Ping Pong buffers to hold the images before SAD. The two entities accessing them are the software application to write the images and the FPGA state machine reading them to apply SAD SAD Configuration Index Register Name size Comments 1. Image length (M) 16 bits Length of the Image in bytes 2. Image Width (N) 16 bits Width of the Image in bytes Kernel length and width in bytes. Kernel length and width have 3. Kernel size L 16 bits to be the same and a multiple of 4 4. Tolerance 16 bits Percentage error tolerance to declare a match 5. Kick 1 bit Kick bit to start the SAD algorithm on video stream 6. stop 1 bit Indication by software to stop SAD 7. Kernel filter L x L x 8 bits The kernel pixel to be used by the FPGA SAD Results Index Register Name size Comments 1 in case of a match 1. Match 1 bit 0 in case of a mis-match 2. SAD sum 32 bits Result of the SAD calculation of the current frame 3. 4 5. Match row number Match column number Frame number 16 bits 16 bits 16 bits Row and column number of the image where a match with the kernel is found Frame number for which the result is generated

The first two blocks are the ping pong blocks, i.e., one of the blocks is being filled by the PCI interface of the computer while the other block is simultaneously being read by the FPGA state machine to carry out the SAD operation. The sizes of these blocks are always the same and are dependent upon the image size that is being used in the video. It is a parameterizable number and is configured in the configuration block of the SDRAM. SAD configuration is the information required by the SAD micro engines to apply SAD according to the user defined variables. The results of SAD are written into the SAD results block of the SDRAM. Table 1 gives the information in all of these four blocks in detail.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 11 3.2 SAD Calculation Micro Engine Components The SAD micro engine consists of data path components comprising chiefly of Multiplexers, Full and Half adders and logic elements. SAD is composed of 3 major hardware parts, i.e. the absolute difference calculation, the partial product reduction and the final addition using a carry propagate adder. All of these steps are carried out using highly optimized and high-speed hardware components. The SAD equation for a block is given by:

SAD( x, y, i, j ) = where

M 1 N 1 u =0 v =0

( x +u , y +v )

B( x + i + u , y + j + v )

(1)

MxN is the dimension of the kernel (x,y) is the location of current block in image (i,j) is the motion vector specifying the kernel shift
3.2.1. Absolute Difference Calculation SAD calculates the absolute difference between the kernel image and the target image. We can rewrite the SAD equation as [2]:

SAD( x, y, i, j ) = where
ADC u ,v

M 1 N 1 u =0 v =0

ADC

u ,v

(2)

X Y , X > Y =| X Y |= Y X , X < Y 0, X == Y

(3)

It is noteworthy here that absolute difference calculation involves the comparison of two numbers (unsigned, since we are dealing with the gray scale values). Comparison in hardware is an expensive operation though. In order to avoid it we utilize another scheme. We carry out the following steps: Convert the two numbers into signed numbers (they become n+1 bits numbers instead of n bit numbers since the MSB indicates the singed bit) Calculate the difference of the two numbers, i.e. X-Y as well as Y-X. If X-Y is a positive number (signed bit is 0) then X>Y and use this difference as absolute difference. If X-Y is negative then X<Y and use Y-X as absolute difference [2] This is termed as the Absolute Difference Calculation hardware Scheme 1 and the resulting hardware is shown in the figure 4. Another (comparatorless) scheme for ADC whose implementation is based on the fact that the 2s complement of a signed complemented number results in the original number. First X-Y is calculated assuming that X>Y. If it is true then the MSB of X-Y will be zero, otherwise we again calculate the two complement of (X-Y). X Y = X + Y + 1, MSB = 0 ADC u ,v = X + Y + 1 + 1 = X + Y + 0 + 1 = X + Y + 1, MSB = 1 (4) This is termed as the Absolute Difference Calculation hardware Scheme 2 and the resulting hardware is shown in the figure 5. Scheme 2 is used for the micro engines.

12

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


Figure 4: Absolute difference calculation Scheme 1
MSB 1 8 1

X
8

8 bit Full Adder with Carry M U


8 8

Y
8

X 8 bit Full Adder with Carry

Figure 5: Absolute difference calculation Scheme 2


MSB 1 1 8

X
8

8 bit Full Adder with Carry

M U X

8 bit Incrementor

4. Partial Product Reduction


Partial product (PP) reduction [14] is the process of reducing N number of layers of numbers to be added to two layers. This reduction is carried out using components like a Full Adder (3 to 2 compressor) and a Half Adder (2 to 2 compressor). The two layers resulting thus are generally known as sum and carry. A carry propagate adder is used to add the numbers as a final step to obtain the result. The name partial product results from the use of fast array multipliers (single cycle multiplication capable) in which two N bit wide numbers multiply to generate N partial products and are reduced using partial product reduction schemes, so resulting in 2 layers to be added in a CPA (carry propagate adder).
4.1 Partial Product Reduction Schemes

In order to represent the numbers to be added in a generic form, DOT notation is used [14]. Each dot represents a bit in the binary numbers to be added (0 or 1). If an isolated dot appears in a column it is simply dropped down to be processed in the next logic level. If 2 dots appear in a column they may be

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 13 added using a half adder (HA) thus producing 2 dots, one in the same level as the other two dots and one in the second column (HA is therefore called a 2 to 2 compressor). A full adder (FA) is used when we have 3 columns in a row. Many researchers have introduced other basic partial product compressors other than HA and FA [15,16]. Following are brief descriptions of several partial product reduction schemes commonly used.
4.1.1 Carry Save Partial Product Reduction Scheme According to the carry save partial product reduction scheme, three layers of PPs are reduced to two layers using Carry Save Addition. The following 3 cases can be dealt as described: Isolated bits in a column in any of the three layers of PPs are dropped down to the next logic Two bits in a column are reduced to two bits using HAs Three bits are reduced to two bits using FAs Using this technique the 3 partial products are reduced to 2. The fourth partial product is then combined with them to make 3 partial products and again reduced using the Carry Save Addition technique. The process is repeated until the numbers are reduced to two layers of numbers. This scheme results in a few least significant product bits termed as free bits. The rest of the bits appear in two layers, which may then be added using any CPA to get rest of the product bits[14]. 4.1.2. Dual Carry Save Partial Product Reduction Scheme In the dual carry save reduction scheme, the partial products are divided into 2 equal size groups. The carry save reduction scheme is applied on both the groups simultaneously. This results into two partial product layers in each group. These four layers are further reduced using two levels of carry save reduction [14]. 4.1.3. Wallace Partial Product Reduction Scheme The Wallace tree reduction method is characterized by grouping the partial products into groups of 3 and reducing all groups simultaneously through carry select adder (CSA). The reduction is performed in parallel in groups of 3s. Thus as the number of partial products increases, the size of multiplier increases too. Each adder level incurs one FA delay in the path. Hence the greater the number of adder levels, the greater will be the critical path of the combination cloud of the PPR unit. Figure 6 shows the total number of full adder levels needed to reduce n partial products to 2 using the Wallace tree reduction scheme [14].
Figure 6: Full Adder Levels in a Wallace tree reduction scheme

14

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch

4.1.4 Dadda Partial Product Reduction Scheme The reduction rate of a Dadda tree compressor is identical to a Wallace tree compression unit. However a Dadda tree compression unit results in an optimal number of hardware elements, i.e. full adders and half adders. This directly translates to lower power dissipation and smaller size [17]. 4.1.4.1 Dadda Tree Partial Product Reduction Scheme for SAD Partial product compression is the second important step in SAD calculation. As a result of a 4x4 block ADC we get 16 unsigned numbers, 8 bits wide each, which need to be added to get the final answer. This addition is done in two steps. In the first step, the 16 layers of bits are reduced to 2 layers using the Dadda tree partial product reduction scheme. In the second step, these two layers of bits are added using a carry propagate adder. Figure 7 shows the Dadda tree reduction of 16 layers to 13 layers using as many FAs and HAs as required. Dadda tree recduction results in 2 layers in several levels. The details of this compression of levels is given in Table 2.
Figure 7: Dada tree reduction level1

Figure 8: Dada tree reduction level1 Hardware

The first row gives the number of partial products and the next two rows indicate the number of Full adders and Half Adders respectively. We can see that in the first level of Dadda tree compression, the 16 partial products are reduced to 13 levels. They are further reduced to 9 layers in the next level and so on until we end up with 2 layers each of 11 bits width which the input to the CPA.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 15
Table 2:
Level 0 16 3 0 Level 1 3 13 4 0 9 3 0 6 2 0 4 1 0 3 1 0 2 13 4 0 9 3 0 6 2 0 4 1 0 3 1 0 2 13 4 0 9 3 0 6 2 0 4 1 0 3 1 0 2 13 4 0 9 3 0 6 2 0 4 1 0 3 1 0 2 13 4 0 9 3 0 6 2 0 4 1 0 3 1 0 2 13 3 1 9 3 0 6 2 0 4 1 0 3 1 0 2 13 3 0 9 2 1 6 1 1 4 1 0 3 1 0 2 13 2 0 9 1 1 6 1 0 4 0 1 3 0 1 2 PPs FA HA PPs FA HA PPs FA HA PPs FA HA PPs FA HA PPs 16 3 0 16 3 0 16 3 0 16 3 0 16 3 0 16 2 1 16 1 1 PPs FA HA

Dada tree reduction levels

Level 2 7 2 0 Level 3 2 6 2 0 4 1 0 3 1 0 2

Level 3 4 1 0 Level 4 1 3 1 0 To the CPA 2 2

5. Carry Propagate Adder


Carry propagate adders are the adders that take into account the carry generated by nth column of partial products into the (n+1)th column. The number of partial products layers incorporated in carry propagate adders is always 2, one called sum and the other carry. A carry in may or may not be included in the LSB of the partial products. These adders generate the final sum of addition of multiple partial products and from timing and area perspectives, these adders are generally a very expensive component in digital design. Some latest families of FPGAs contain built-in core components on the FPGA that include signed adders, multipliers,etc. There are several implementation techniques for a carry propagate adder. These implementation techniques are generally a compromise between area and time. Among the most popular of these techniques, the simplest, as well as the slowest, is the ripple carry adder. Better implementation techniques in terms of time, are the carry look-ahead adder and the carry select adder. The following is a brief description of each kind of carry propagate adder, along with the description of the adder used in the SAD micro engine.
5.1. Ripple Carry Adder (RCA)

A ripple carry adder is the simplest and the slowest kind of carry propagate adder. Its working principle is based on the rippling of a carry from one FA to the next until the final full adder is reached. Figure 9 shows a 6-bit ripple carry adder. It works just the way addition of binary numbers is done with pencil and paper. Starting from the right most 2 bits (or 3 bits in the case a carry-in is also present) a sum bit is generated for that column and the carry is forwarded to the next column. This carry may be a zero or a one. It is passed to the position on its left of that column and addition is carried out again, producing a sum bit and a carry bit to be put forward.

16

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch

Hence due to rippling of the carry bit from the right most position up to the most significant bit on the left, this adder is called the ripple carry adder. Another noticeable and undesirable feature of this adder is that the delay of the adder is directly proportional to the width of the adder and increases linearly with it [14].
Figure 9: Six-bit ripple carry adder[14]

5.2. Carry Look Ahead Adder(CLA)

Unlike a ripple carry adder in which an adder has to wait for the generation of carry from the adder on its right side bit position, the carry look ahead adder attempts to generate all the carries at different bit positions in parallel hence avoiding the carry propagation wait. The carry, C i +1 , produced at the ith stage is given as follows : C i +1 = xi y i + ( xi y i )C i (5) The equation clearly shows that a carry is generated when either of the following two conditions occurs : A one is generated from the preceding stage (both operand bits are 1) A one is propagated from the preceding stage (if one of the operand bits is 1) Therefore, let Gi and Pi denote the generation and propagation at the ith stage, we have: Gi = x i y i (6) (7) C i +1 = Gi + Pi C i (8) Extending this for the higher bits we get the carry generation and propagation logic for the entire bit positions in the adder. These require just 2-gate delay implementation and hence a constant time delay is required to calculate the carry of the adder [14].
5.3. Carry Select Adder

Pi = x i y i

A carry select adder is a type of carry propagate adder in which the length of adder bits is broken into groups (of equal or unequal in length). These groups of bits are added separately using RCA or CLA (discussed above) assuming a carry bit of both 0 and 1. These calculations take place simultaneously (in parallel) and once the carry in bit for the LSB is available the correct sum pertaining to the corresponding carry is selected along with the carry out for that stage of carry select adder. This carry out selects the correct sum and carry out pair for the next stage of carry select adder and so on.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 17 The delay of such an adder is equal to the delay of the adder used in a single stage plus the delay of multiplexers used to select the correct answers. A 12-bit carry select adder is used in the SAD micro engine. The 12 bits are divided into 3 groups of 4 bits each. The adder type used for each group of bits is the Carry Look Ahead Adder. The logic diagram of the adder is given below in Figure 10[14].
Figure 10: 12- bit Carry Select Adder

C o

C = in 0 4-bit adder

Co So

C = in 0 4- bit adder

Co So

C =0 in 4- bit adder

So
C =1 in 4- bit adder

C = in 1 4-bit adder

C = in 1 4- bit adder

C1

S1

C1

S1

C1

S1

2to1M ux C out[11]

4bit 2to1M ux

C out[7]

2to1M ux

4bit 2to1M ux C out[3]

2to1M ux

4bit 2to1M ux C in

6. SAD Macro Blocks Using Atomic Blocks


Using the individual SAD calculation units for one byte, we get a 4 x 4 byte SAD calculation atomic block. The 16 SAD calculation units generate a total of 16 values that are added in a carry select adder to give a 12 bit result. The logic diagram of one of the 4 x 4 SAD micro engines is given in Figure 11. If we want to use a kernel of larger size, e.g. 64 x 64, we will use copies of 4 x 4 SAD calculation atomic blocks. Multiple blocks will add parallelism in the hardware realisation. A 16 x 16 array of the atomic micro engines will be used, and as a consequence 256 results will be generated, this being the sum of 4 x 4 individual SAD blocks. To get the SAD for the 16 x 16 blocks, these numbers are added using carry select adders, as described in the previous section. This addition requires 3 pipeline stages so that the throughput is maximized but this is at the cost of a latency. Using this arrangement of SAD micro engines we can generate a design for the SAD calculation of any size of kernel and image, provided the length and width are a multiple of 4.

18

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


Figure 11: A 4x4 SAD Micro Engine
MSB 1 1 8

X Y
8

8 bit Full Adder with Carry

M U X

8 bit Incrementor MSB 1 1

Partial
12

X Y
8

8 bit Full Adder with Carry

M U X

8 bit Increm entor

. . .
MSB 1 1 8

Sum Partial Product R eduction Product Of 8x16 Bits U sing D D AD A 12 Tree R eduction Reduction Carry Scheme

12

12 Bit Carry Look Ahead Adder


1

Carry

X Y
8

8 bit Full Adder with Carry

M U X

8 bit Increm entor

7. Controller
The controller takes care of the entire system of SAD engines. The small SAD engines discussed above are atomic units of the system that work as the data path components acting on the commands of the main controller. The Controller takes care of parameters such as the width and the depth of the image on which SAD takes place, the length of the kernel image for SAD calculation (kernel image length and width are of the same size). These parameters are configurable and are written over by the user through software before the user initiates a go signal to the main SAD controller. Hence following is the sequence of event that takes place User configures the SAD parameter dimensions User fills the kernel memory with the kernel values User initiates a go signal to the main controller

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 19
Figure 12: Flowchart of the algorithm

IDLE

SDRAM configurations by software

Kick start SAD

STOP bit Set?

Start SAD for next Frame

Row Col Setting

End of image ?

SAD caluculations

Match?

Y
SDRAM result update

SAD calculations are performed on the image and after every calculation the SAD sum is calculated and a match or non-match is declared. In the case of a match, the next frame is picked and the SAD is started again. If the entire image is scanned and no match is found the next frame is initiated, the row and column number are reset and the whole procedure is started again until the user stops the SAD algorithm by setting the stop bit. The flow diagram of the controller is shown in figure 12.

8. Error Tolerance
The SAD calculated between the kernel and the actual image may deviate from the perfect match results for a number of reasons. If a kernel object is found in the image, the sum of absolute difference between the pixel values of the two image kernals should be zero. This however is never exactly the case, even if a matching object is found. The reason for the non-zero result may be the noise in the

20

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch

image, other slight occlusions blocking the view of object in the image, gray scale variations and other factors such as a slight tilt of the object in the dimension parallel to the imaging camera etc. In order to account for the deviation from perfect results of the SAD method, a certain tolerance is added to allow a positive return from the SAD algorithm. This tolerance is a percentage of the kernel size employed. This is a parameterizable value and can be changed to loosen or tighten the matching criteria in the image. The tolerance value is written by the application software into the FPGA at the time of configuration and is always a percentage of the kernel size used. A summary of the calculation for a 64 x 64 block kernel for a 5% tolerance is shown in Table 3.
Table 3: Error Tolerance Calculation for a 64 x 64 kernel

Error Tolerance Calculation Kernel Length = 64 bytes Kernel Width = 64 bytes Kernel size = 64 x 64 = 4096 bytes 5 % of kernel size = 5% x 4096 = 204.8 Tolerance limit = 255*204.8= 52224 = CC00h Match if SAD <= 52224

Hence if a smaller value results after adding the results of the 4x4 SAD engines to obtain the answer for the 64 x 64 SAD calculation, the match will be declared if the sum is less than 52224.

9. Simulation Results
A software was designed to perform the system code verification. We selected car images with different inclination angles to act as filter images for SAD. For test images we acquired some coloured road images and later converted them to grayscale images using the software script file. Later the filter image was inserted into the desired row and column number of the road image and SAD operation was carried out on the resultant file. The following pictures show the grayscale image of the road and the filter (car image). Insertion of the image file into the target road image is done using a filter mask image. The size of the mask image is kept equal to that of the filter image. For all the bit positions in the filter image having zeros, the mask image has ones and for the rest of the non zero gray scale values in the filter image the mask has zeros. This mask when ANDed with the target images, leaves a black space for the non-zero values of the image file. Later the logic OR of this image with the filter image gives us the desired target image with inserted filter. Noise or occlusion addition can be done to verify the threshold test.
Figure 13: Target Road Grayscale Image

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 21
Figure 14: Filter Image

Figure 15: Filter Mask Image

Figure 16: Target Road Grayscale Image ANDed with filter mask

22

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


Figure 17: Target Road Grayscale Image after insertion of Filter image

9.1. SAD Micro-Engine Verification Results

In order to verify the design and evaluate the performance of the SAD Core it was coded in Verilog HDL and thoroughly verified using the ModelSim 5.7G simulator. The initial implementation was done for a SAD micro engine capable of carrying out a SAD with a 4x4 bytes kernel with a 256x256 byte image. The inputs given for verification consisted of both targeted values to check the corner cases faced when the gray scale pixel of the images reach their extreme values and also real images on which the kernel was meant to be found. The following diagram gives the simulation waveform generated for verification of the 4x4 byte SAD micro engine. The first two signals are the clock and reset signals. The reset used for simulation is synchronous and high, as evident from the waveform. The next 16 signals (A11, A12, A44) are the gray scale values of the 4x4 kernel used for calculation of SAD. These are unsigned values, displayed in hexadecimal format in the figure below. The image values for the window under observation are shown with subscript B (B11, B12, B44). The Sum and Cout are 12 and 1 bit wide final outputs of the SAD micro engine respectively. It can be seen that the calculation of a 4x4 byte SAD takes 3 cycles of pipeline latency, one for calculation of ADC, one for Dadda tree reduction and a final stage for the addition using a Carry select adder. It can be seen from the simulation that immediately after de-asserting of the reset signal the kernel and the image takes on legal values. A11, A12, A44 takes on values 0x00, 0x01, 0x02, 0x03, 0x14, 0x05 while B11, B12, B44 takes on values 0x03, 0x03, 0x03 respectively. After 3 clock cycles the result calculated of SAD is 0x264. After the latency seen initially the results are calculated on every clock cycle. For an atomic block of SAD for one byte the hardware architecture along with the simulation results are shown below. This block generated A-B or B-A depending upon whether A>B or B>A. Both of the possible results are calculated and then the appropriate one is chosen depending upon the signed bit of the intermediate result.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 23
Figure 18: Verification results of SAD micro engine

Figure 19: Verification results of SAD unit implementation Scheme 1 results

The second implementation of the SAD unit block exploits the fact that the 2s complement of a negative number gives its positive equivalent. Hence A-B is calculated first assuming A to be greater than B always; the result is fine if Answer is a positive number otherwise the result is inverted by a 2s complementer to get the correct answer.
Figure 20: Verification results of SAD unit implementation Scheme 2 results

9.2. Test Case 1

In this test case the target car image is randomly placed close to the upper left corner of target image as shown in figure 21. The size of the filter image was 70x93 bytes and hence the kernel size is taken to be the same for simulation. The size of the grayscale target image is 426 x 640 bytes. Since the SAD calculation for the kernel starts from the top left corner of the target image and continues from left to right in each row, and them from top to bottom, the image is detected by SAD quickly and the

24

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch

scanning is stopped. Hence the number of clock cycles and the time taken by this image detection is much less compared to other test cases.
Figure 21: Target Image with filter close to top left image corner

The wave form in the next figure shows different signals that are part of the SAD architecture. The match bit indicates whether the SAD for the current location results in a match or not. The two variables im_row and im_col are row number and column number of of the image on which the kernel image is placed and being scanned. The register SAD_sum, as its name implies, has the SAD of the image and the kernel corresponding to im_row and im_col positions on the image. The registers N_IM_R and N_IM_C contain the number of rows and columns of the target image, respectively, while the registers N_KR_R and N_KR_C contain the number of rows and columns of the kernal image respectively. Match_thresh is a user defined number that defines the threshold for match in case we want to introduce some tolerance in our matching criterion. The simulation signals a match at im_row =14 and im_col=16 while the number of clock cycles taken are (14*16) +3= 227. (3 for pipeline latency). For a 133MHz clock the time taken to detect the match is 1.706766 micro-seconds.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 25
Figure 22: Simulation Waveform

9.3. Test Case 2

This test case checks the threshold of the SAD match criterion. Here the inserted image in the target image file and the kernel image used during SAD are slightly different from each other as shown in the two images below. If the SAD for the two images comes out to be less than the threshold specified by the user the engine declares a match. Hence the noise and slight rotation of the subject can be tolerated and the subject can be detected.

26

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


Figure 23: Filter image with 140 degree rotation

Figure 24: Filter image with 145 degree rotation

The image shown on the left is inserted in the target image at row 1 and column 1. For the kernel image, the image shown on the right is taken. The SAD does not give a perfect match but does detect a match for a threshold value 0xd000 (SAD_sum comes out to be equal to 0xe904 for this case as shown in the waveform).
Figure 25: Test case 2 simulation results

10. Synthesis Results


The SAD micro engine was synthesized for several families of Xilinx FPGAs. Since the code is written in verilog, which is platform independent, it could be synthesized for several FPGA platforms. A single 4x4 SAD micro engine is small enough to fit on a small FPGA, e.g. the XC2V80. The Synthesis result summary using a Xilinx ISE7.1i on the targeted device, the Xilinx XC2V1000, is given in Table 4 below.

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 27
Table 4: Device utilization summaries
Device Utilization Summary Selected Device : 2v1000bg575-4 375 out of 162 out of 657 out of 270 out of 1 out of

Number of Slices: Number of Slice Flip Flops: Number of 4 input LUTs: Number of bonded IOBs: Number of GCLKs:

5120 10240 10240 328 16

7% 1% 6% 82% 6%

The device utilization summary shows that the resources taken by the SAD core are very meager; as described above, multiple copies of the core will generally be used to carry out the SAD of blocks larger than 4x4 in size. The timing results summary given in Table 5 suggests that the digital design can run at a clock speed of around 133MHz. Since pipelining has been exploited in the design of the SAD engine, it will give a result in a single cycle. Multiple copies of the engine will give more parallelism, so multiplying the speed still further.
Table 5: Timing summary

Timing Summary Speed Grade: -4 Minimum period: 7.505ns (Maximum Frequency: 133.245MHz) Minimum input arrival time before clock: 5.557ns Maximum required output time after clock: 5.446ns Maximum combinational path delay: No path found

11. Speed Calculations for Video


The SAD micro engine is extremely fast, its pipeline architecture increasing the throughput speed.. As an example of the high speed of calculation possible, Table 6 summarises the time taken to carry out the SAD operation for a 64x64 byte kernel applied to a 1024x1024 byte gray scale image.
Table 6: SAD timing calculations

SAD Timing Calculations Size of Image= 1024x 1024 Size of kernel = 64 x 64 No of SAD operations required = 1024 x 1024= 1048576 Micro engine Execution speed = 133245000 MHz = 133.245 MHz No of SAD operations per second = 133245000/1048576 = 127.0723343

The calculations show that the SAD micro engine infrastructure is capable of performing the SAD operation on all the possible pixel positions in a 1024 x 1024 byte image with a kernel size of 64 x 64 bytes at the rate of 127 frames per second. However this gives the worst-case results since a perfect match can be found much earlier during the SAD search and so often less calculation may be necessary. Table 7 gives the time taken by the SAD calculation for the best, average and worst cases of the object image location in the full image array.

28
Table 7:

Saad Rehman, Rupert Young, Chris Chatwin and Phil Birch


SAD timing calculations best, worst and worst cases

SAD Timing Calculations, best, worst and average cases Best Case (Match found in the first kernel frame) No of SAD operations required = 1 Micro engine execution speed = 133245000 MHz = 133.245 MHz No of SAD operations per second = 133245000/1 = 133.245 M Average Case (Match found in the middle of the frame) No of SAD operations required = 512 x 512= 262144 Micro engine Execution speed = 133.245 MHz = 133245000Hz No of SAD operations per second = 133245000/262144 = 508.289 Worst Case (no match found) No of SAD operations required = 1024 x 1024= 1048576 Micro engine Execution speed = 133.245 MHz = 133245000Hz No of SAD operations per second = 133245000/1048576 = 127.0723343

12. Conclusions
This paper discusses a time efficient implementation of Sum of Absolute Difference algorithm for gray scale images. The design is scalable, parameterizable and configurable according to the needs of the user, giving considerable flexibility. The current design of the SAD micro engine works at the high speed of 133 MHz. It gives parallelism as well as scalability, resulting in a powerful design. For a kernel size of 64 x 64 bytes a 2-D array of 8 x 8 micro engines is used. To get further improvement in speed we suggest a design using micro engine arrays of greater dimensions that can easily fit into larger FPGAs. Further pipelining the design will increase its speed, resulting in greater latency but better throughput. Another method of achieving speed enhancement can be to use boards that have an array of FPGAs along with a high speed connection to the host PC. Using more than one board on the host PC can further increase the throughput of the system.

References
[1] [2] [3] Wong, S.; Vassiliadis, S.; Cotofana, A sum of absolute differences implementation in FPGA hardware, Euromicro Conference, 2002, Pages 183-188. Yufei, L.; Feng Xiubo; Wang Qin, A High-Performance Low Cost SAD Architecture for Video Coding, IEEE Transactions on Consumer Electronics, Vol 53, Issue 2, pages 535-541. Vanne, J.; Aho, E.; Hamalainen, T.D.; Kuusilinna, K, A High-Performance Sum of Absolute Difference Implementation for Motion Estimation, IEEE Transactions on Circuits and Systems for Video Technology, Vol 16, Issue 7, Pages 876-883. http://www.newmediarepublic.com/dvideo/compression/adv09.html : visited 5th May, 2009. http://www.compuphase.com/cmetric.htm : visited 5th May, 2009. http://en.wikipedia.org/wiki/Cross-correlation : visited 5th May, 2009. http://en.wikipedia.org/wiki/Automated_fingerprint_identification : visited 5th May, 2009. Krisda Lengwehasatit and Antonio Ortega, Probabilistic Partial-Distance Fast Matching Algorithms for Motion Estimation, IEEE TRANSACTIONS ON CIRCUITS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 2, FEBRUARY 2001 Lower case. D Guevorkian, A launiainen, P liuha and V lappalainen, Architectures for the Sum of Absolute Differences Operation, IEEE workshoip on signal processing systems, 2002, 57-62. Cao Wei and Mao Zhi Gang, A novel SAD computing hardware architecture for variable-size block motion estimation and its implementation with FPGA, 5th International conference on ASIC, pages 950-953, 2003.

[4] [5] [6] [7] [8]

[9] [10]

An FPGA Based Generic Framework for High Speed Sum of Absolute Difference Implementation 29 [11] Stephanie McBader and Peter Lee, An FPGA Implementation of a Flexible, Parallel Image Processing Architecture Suitable for Embedded Vision Systems, International Parallel and Distributed Processing Symposium,2003. Hasan F. Ates and Yucel Altunbasak, sad reuse in hierarchical motion estimation for the h.264 encoder, IEEE ICASSP 2005. Kristan Ambrosch, Martin Humenberger, Wilfried Kubinger and Andreas Steininger, Hardware Implementation of an SAD Based Stereo Vision Algorithm, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007. Loke Kim Tan, High Performance Multiplier and Adder Compilers, Master Thesis, University of California, Los Angeles, 1992. Mora Mora, H, Mora Pascual, J, Sanchez Romero, J.L, Pujol Lopez, F, "Partial product reduction based on look-up tables" VLSI Design, 2006. Held jointly with 5th International Conference on Embedded Systems and Design, 19th International Conference 3-7 Jan. 2006 Page(s):6 pp. Mokrian, P.; Howard, G.M.; Jullien, G.; Ahmadi, M.;"On the use of 4:2 compressors for partial product reduction" IEEE CCECE 2003.Volume 1, 4-7 May 2003 Page(s):121 - 124 vol.1 L. Dadda, Some Schemes for Parallel Multipliers, A/ta Frequenza. Vol. 34, Mar. 1965, pp. 34956.

[12] [13]

[14] [15]

[16] [17]