Takashi Saegusa, Tsutomu Maruyama and Yoshiki Yamaguchi
Systems and Information Engineering, University of Tsukuba
1-1-1 Ten-ou-dai Tsukuba Ibaraki 305-8573 JAPAN
In image processing, FPGAs have shown very high performance in spite of their low operational frequency. This high
performance comes from (1) high parallelism in applications in image processing, (2) high ratio of 8 bit operations,
and (3) a large number of internal memory banks on FPGAs
which can be accessed in parallel. In the recent micro processors, it becomes possible to execute SIMD instructions
on 128 bit data in one clock cycle. Furthermore, these processors support multi-cores and large cache memory which
can hold all image data for each core. In this paper, we compare the performance of FPGAs with those processors using
three applications in image processing; two-dimensional filters, stereo-vision and k-means clustering, and make it clear
how fast is an FPGA in image processing, and how many
hardware resources are required to achieve the performance.

high performance in image processing. However, the programming using these SIMD instructions is very tricky, and
the performance varies considerably according to programming skill.
We have implemented several applications in image processing on FPGAs, and tried to achieve the highest performance by minimizing the number of operations and memory accesses. The methods used for the designs can also
be used for the programming using SIMD instructions. In
this paper, we try to make it clear how fast is an FPGA compared with the recent processors with SIMD instructions and
multi-cores[9]. We compare the performance using three applications; two-dimensional filters[1], stereo-vision[2][3][4],
and k-means clustering [5][6][7][8]. In these applications,
the performance by an FPGA can be improved using larger
FPGAs. The comparison is discussed from the view point
of the problem size, FPGA size and memory bandwidth.

Many applications in image processing have high inherent
parallelism, and the data width of many operations is less
than 16 bit. FPGA can execute those operations in parallel
by configuring dedicated circuits for each application. Large
number of internal memory banks on FPGAs also support
this parallel processing by enabling parallel accesses to several hundreds data which are cached in them. Because of
this high parallelism, FPGAs show very high performance in
image processing in spite of their low operational frequency.
In order to achieve high performance using a hardware platform with higher operational frequency, graphics processing
units (GPUs) have also been used, and shown very good performance in some applications. However, they are originally
designed for a specific sequence of operations, and it is difficult to realize high parallelism in various applications.
Micro processors have also supported SIMD instructions
for parallel processing, and it becomes possible to execute
a SIMD instruction for 128 bit data in one clock cycle in
the recent processors. These processors also support multicores, and each core can execute SIMD instructions independently. Furthermore, the cache size is large enough for
storing all image data for each core. Because of these progresses in the processors, it becomes possible to realize very

978-1-4244-1961-6/08/$25.00 ©2008 IEEE.

In recent micro processors such as Intel Core 2, SIMD instructions for 128 bit data (16 operations for 8 bit data, 8 operations for 16 bit data, and 4 operations for 32 bit data etc)
can be executed in one clock cycle (these SIMD instructions
were supported in previous processors, but they take more
than one clock cycle). Furthermore, these processors support multi-cores, and large cache memory which can hold
all image data for each core. The maximum parallelism becomes 4×16 in the current version with quad cores. With the
very high operational frequency of these processors (3GHz
or more), this parallelism will enable very high performance
in image processing.
However, the programming using these SIMD instructions is very tricky. Sequential parts in the programs dominate the total computation time (Amdahl’s law), and we
need to reduce those parts very carefully. This programming
is very similar to the hardware design, because we need to
make all stages of all pipelined circuits busy to realize higher
performance. In FPGA design, we also need to minimize
the number of memory accesses to external memory banks.
This also helps the processors to realize higher performance
by reducing the memory accesses.


and the 2k-th and (2k+1)th data in the eight products are added in parallel... SADxy (x+1.reg Input reg . and multiplied by Gx (dx). SADxy (x+1. y–k) k = 1. The read-out data and I(x. y+w] { for each 8 dx in [x–w.4) I(x.. one pixel value I(x. STEREO VISION In the stereo vision system.. x+w]) { eight I(x+dx.. y+dy). Then..y) x Gx[0] x reg Gx[2] reg x G(4. and then Gx (dx) to the results. . the number of operations is (2w+1)+(2w+1) for multiply operations..1 left (non-separable). // stored in 8×16b wide array. y) is obtained. In order to find the projections. and then summed up. y. The right half in Fig. y. Fig. and I(x. Finally. the computation for applying a non-separable filter to I(x.0) reg Gx[4] 2 reg reg Gx[3] line buffers . d)–SADy (x–w.y) non-separable separable Fig.y-4) 2w + S(x.y+dy)| dx=–w dy =–w In this equation. and the distance to the location is obtained from the disparity. . 2. t8×16b ← 8 u8×16b ← eight coefficients. projections of the same location are searched in two images (Ir and Il ) take by two cameras. 4 sum4×32b ← sum4×32b + v4×32b . y) = I(x+dx.. The sum of absolute difference (SAD) is widely used to compare the windows because of its simplicity. Fig. Then. } Fig. x G(3.. x Gy[1] x + reg reg reg reg reg Gy[2] x Gy[3] x x . a small window centered at a given pixel in Ir is compared with windows in Il on the same line (epipolar restriction) in area-based matching algorithms... d) = SADxy (x. d)= |Ir (x+dx. y. S(x. x reg . The read-out data and I(x. 1. the values of eight pixels are multiplied by the eight coefficients.2) . the filter is called separable. y) = dx=–w dy=–w This equation means that S(x. At the same time. Therefore. and S(x. 8 m8×16b ← t8×16b × u8×16b && 8 v4×32b ← the 2k-th value + (2k+1)-th value in m8×16b . The products are summed up by an adder tree. and the small number below an arrow shows the parallelism of the SIMD instruction. the four partial sums are summed up sequentially. w w   S(x. d) as follows. d) w  SADy (x.1 shows block diagrams of circuits for non-separable and separable filters (w = 2).2 show an outline of the program for non-separable filters using SIMD instructions. and its range D decides how many windows in Il (their centers are Il (x+d. dy) dx=–w dy=–w where G(dx. dy) is the coefficient on (dx. the window in Ir is shifted by one pixel along the x axis. and 2w+2w for add operations.0) x G(3. d) can be calculated from SADxy (x. and S(x. dy) can be rewritten as Gx (dx)·Gy (dy) (this means that the coefficients in the filter can be decomposed along the x and y axes). y.. y+dy)–Il (x+d. 4 } } sum up the four 32b data in sum4×32b . and can apply the filter to one pixel in one clock cycle (as its throughput). (2w+1) × (2w+1) is the size of the window centered at a pixel Ir (x.. y) can be obtained by applying Gy (dy) first to pixels on the same column. y).3 shows how to calculate SADxy (x. D−1] which minimizes the following equation is searched..y+dy)–Il (x+dx+d.. As for separable filters. This circuit is fully pipelined. y+dy) · G(dx.. dy) If G(dx. y) are held on the register array for 2w+1 clock cycles and multiplied by G(dx. Then. y) can be calculated as follows. Fig. 2w). In Fig. Then. When SAD is used. data in the line buffers (the number of line buffers is 2w) are read out in parallel (these data are I(x. x x x Gy[4] + S(x. y+dy)| dy=–w In Fig... the outputs of the line buffers and I(x. x G(3. y) is given to the circuit every clock cycle. Suppose that SADxy (x. y) becomes as follows. y..3(B)). d)+SADy (x+1+w. y+1). y) in the image. the four sums are added to four partial sums in one 128 bit data respectively. w  w  SADxy (x. . y+dy) · Gy (dy)} · Gx (dx) S(x. y. reg x x G(3. reg x .. SAD of the pixels on x+1+w (gray boxes) corre- 78 . reg . y) are multiplied by Gy (dy) first. Then.3(A). y) the value of a pixel at (x. d) have been calculated for all d. The numbers following to the variable names show their data width. 4...y) x G(0. and (2w+1) × (2w+1)–1 for add operations. the sums are held on a shift register for 2w+1 clock cycles. dy)..0) x G(0. TWO-DIMENSIONAL FILTERS Let w be the radius of the filter. d is the disparity. and compared with D windows in its target area (Fig. y). and sent to the register array.3(B). 4 for dy in [y–w. The 2w+1 products are summed up by an adder tree. A program for non-separable filter 3. y. This window is compared with D windows in its target area (whose width is D+2w) in Il (the left half). y. y. y) { sum4×32b ← 0.. The number of operations is (2w+1) × (2w+1) for multiply operations. shows a window in Ir whose center is Ir (x. w w   { I(x+dx.. reg from line buffers register array .. d) efficiently. reg x G(3. y)) are compared with the window.1) x Gy[0] Gx[1] 1 reg x G(4.. the value of d in [0.3) . y.. y) is obtained. and given to the register array. y) are written back to the next line buffers for the calculation of I(x. d) = |Ir (x.1) . In the program. Circuits for non-separable and separable filters x for each (x.1) x reg . I(x.y-1) I(x.

y+w) and Ir (x–w.3(A). and an integer K. As shown in Fig.d+7 (up to 8 bit) are inserted to the eight 8b fields generated by the shift operations. y–w). d) are zero).x-d d x D+2w 4. d) (and stored in a temporal buffer). d) can be obtained by just adding the absolute difference for Ir (x+1+w. d) was already calculated for SADxy (x–2w. Then. stored the SADs in the memory (those SADs are used for the calculation of SADxy (x–2w. Figure 5 shows one iteration of the simple k-means clustering algorithm. y+1. D−1] in parallel starting from (x=–w. find d which minimizes SADxy (x+1. By applying the k-means clustering algorithm to color images. d) (SADs for the dark gray pixels on x–w are obtained). (B) 7. we only need to calculate SADy (x+1+w. (SADxy (x+1.. d) into the FIFO. d) from the FIFO (they were put when SADxy (x–2w. x-w w x+1 wx+w+1 x } w y+1 w x-d+1+w (C) Ir (x+1+w. d). the goal of the k-means clustering is to partition the points into K subsets (Ci (i = 1. Therefore. In this case. squared distances to K centers are calculated for each point in the dataset and the minimum of them is chosen (the point belongs to the cluster which gives the minimum distance).. y. calculate |Ir (x–w. 2×D for subtract operations.d+1.. Therefore. x+1+w } w x-d+1+w w Ir (x-w. y. y+1. y. d). 16 absolute differences are calculated in parallel for Ir (x+1+w. Fig- 79 . Then. 5. Fig. SAD of the dark gray pixels was already calculated in Fig. 2×D for add operations. The total number of the operations is 2×D for absolute differences.y+w) load r x-d+1+w memory X D Fig. d) are copied into two words of 4 ×32b wide buffer (buf[]) by expanding 16b data to 32b.3. d).3(C)). d) on the next line). y+w)|. the parallelism of most sentences in the main loop is 8. y. 8 SADxy (x+1.y-w) x }} y FIFO FIFO FIFO FIFO w x+1 x+1+w w }} (D) store x-w I (x+1+w. d). and D –1 for comparison in step 5. y. by storing SAD for the dark gray pixels during the computation of the previous line. y. d) are obtained).3(B). These operations are repeated until no improvement of E is obtained.. but in this program. A computation method of the stereo vision sponds to SADy (x+1+w. x w w w y (A) w x-d-w x-d+1 d x x-w w x+1 y w x-d-w x-d+1 d 6. the four 32 bit data in buf[] are compared (the k-th 32 bit data in buf[i] is compared with only the k-th 32b data in buf[j]). and the four minimums are obtained. 2.y+1+w) D x-d-w x-d+1 d 5. and its lowest 8 bit gives the d which minimizes SADxy (x+1. and subtract them from the values in step 3. The 8 SADxy (x+1. When the window is moved to the right end along the x axis. and add them to the partial SADs (SADy (x+1+w.3(D) summarizes how to calculate SADxy (x+1. and that on x–w correspond to SADy (x–w. the minimum of the four minimums is chosen. After calculating all SADxy (x+1. d). and subtract them from SADy (x–w. SADy (x+1+w. y. y. The two words are shifted to left by 8 bit. d) is calculated in the same way (Fig. because SADy (x–w. y. and add them to SADxy (x. Then. and d. y+1. new centers are calculated from the points which belongs to each cluster. y=–w) (initial values of SADxy (x. First. put SADy (x+1+w. y–w)|. In Fig. y. d). get D SADy (x–w. FIFOs are used to hold SADy temporarily. y. SADxy (x+1. 1. K-MEANS CLUSTERING Given a set S of D-dimensional points. First. load D partial SADs stored in the memory (SADs for the dark gray pixels on x+1+w). The following procedures are executed for all d in [0. because the data width of SADxy is 16 bit (w = 3 or 4 in general). 3. y. calculate |Ir (x+1+w. the window is moved back to the left end of the next line.4 shows an outline of the program for the stereovision using SIMD instructions. y. y+w)–Il (x+1+w+d. y. K)) so that the following error function is minimized. SADy are stored in the memory. K   (x–centeri )2 E= i=1 x∈Ci  where centeri = x∈Ci x / |Ci | (the mean of points in Ci ).4. d). and can be reused. d) were calculated). y. y. d) are obtained). Fig. y+1+w) (light gray box) to it. y. Then. y). SAD of other pixels is already calculated in Fig. d) are calculated in parallel using the first and second 8 of the absolute differences. y–w)–Il (x–w+d. y. we can reduce the number of the colors in the images to K while maintaining the quality of the images. 3.

pixels in one image are stored in three memory banks (m0. 4 min1×32b ← the minimum of four 32b in min4×32b . 8 s8×16b ← 8 partial SADy in mem8×16b [x+w+1][d/8]. y+w)|. and the width of the squares of the differences is 16b (unsigned). for each (x ∈ S) { z ← the closest point in Z to x. 2 d& d Min 2 d& d 71 d& d2 72 d& d2 73 d& d2 74 . In Fig.. the minimum of (the distances << 8 | cluster number) is searched. m1 and m2).count+1. . SADxy8×16b [D/8] ← {0. EXPERIMENTAL RESULTS We have implemented the programs with the SIMD instructions on Intel Core 2 Extreme QX6850 (3GHz.count.. 96 × 3 units to calculate squared distance are used..... the distances to eight cluster centers are calculated in parallel for each pixel. return E. which means that 96 squared Euclidean distances can be calculated in parallel (24 d2 for each pixel because four pixels are processed in parallel). 6. y+w)–Il (x+1+w+d. In the program.count ← z. 8MB L2 80 . y) and the cluster centers is 8b.. the parallelism of the sentences in the main loop is 4 or 8.weightCentroid/z. squared Euclidean distances to K centers have to be calculated for each pixel. xB ). Next Images 2 Sum 1 External SRAM banks m5 0 FPGA . and the cluster number for the nearest cluster can be obtained. 8 s8×16b ← s8×16b – t8×16b . While processing one image using four memory banks. E ← E+(x–z)2 .. we need three multipliers to calculate one distance. ... 4 the 2nd four data in SADxy8×16b [d/8]. .. 4 for each data in buf4×32b [k] min4×32b ← min {min4×32b ..7. +1 m6 d& d 47 d& d2 48 d& d2 49 2 d& d 50 Min External SRAM banks m1 |Ci| .. and the cluster numbers (up to 8b) are inserted to the 8b fields (the data width of the distances is less than 24b). and the data width of three memory banks is 32b×3) every clock cycle. In order to find the closest center.. The distances in the two variables are shifted to left by 8b. A program for the stereo-vision ure 6 shows a block diagram of a circuit for the simple kmeans clustering algorithm for 24-bit full color RGB images... 4.. +1 d& d2 23 d& d2 24 2 d& d 25 2 d& d 26 m4 Σ (x-center i) x in Ci . next image can be downloaded to other memory banks.. We need to find the cluster number of the cluster which is closest to I(x. The width of the distance becomes larger than 16b. 5. buf4×32b [d/8×2+1] ← 4 shift buf4×32b [d/8×2] and buf4×32b [d/8×2+1] to left by 8b.OneIteration (Point Set S. Then. Therefore. .. 8 // 8 SADxy (x+1.6.. mind1×8b ← the lowest 8b of min1×32b ... The squares are summed up for R. x < X+w. d) are obtained t8×16b ← the 8 values in Vm16×8b . // 8 partial SADy on x–w 8 mem8×16b [x–w][d/8] ← s8×16b . y++) { for (x = –w. One iteration in the simple k-means algorithm PCI Bus 2 Min .. Vm16×8b ← 16 for the first and second 8 d in Vp16×8b and Vm16×8b { t8×16b ← the 8 values in Vp16×8b ... +1 2 .. xG . z. // store the 8 SADy mem8×16b [x+w+1][d/8] ← 8 s8×16b ← 8 SAD in mem [x–w][d/8]. 0}. and the squared distance is obtained... } for each (z ∈ Z) z ← z. x++) { for each 16 d in [0. Fig.. Then. its squared Euclidean distance to centeri is d2 = (xR –centeriR )2 +(xG –centeriG )2 +(xB –centeriB )2 and.weightCentroid ← z.. y–w)–Il (x–w+d. Suppose that the value of a pixel is (xR ... the eight squares are summed up into two variables slow. D−1] { Vp16×8b ← |Ir (x+1+w. A circuit for the simple k-means clustering algorithm clustering using SIMD instructions... because data width of RGB of I(x. Σ (x-center i) x in Ci m7 i x in Ci Div m3 Fig. z. } } min4×32b ← buf4×32b [0].. y.. buf4×32b [k]}. } } |Ci| Fig.. y < Y +w. y 8×16b 8 t – s . In Fig.. 8 t8×16b ← t8×16b + s8×16b . +1 95 2 ΣΣ (x-centeri) cluster number Converge? .. but the data width of their differences is 9b (signed). y)... and four pixels in the three memory banks are read out at the same time (data width of four pixels is 24b×4. fill the 8b fields with d. As shown in Fig. and the results (cluster numbers for the four pixels) are stored in m3.. 16 |Ir (x–w. y–w)|.... m2 . G and B..high .. 6... After processing all pixels. // initialization for (y = –w..... four partial sums stored in internal memory banks are summed up to calculate new cluster centers.7 shows an outline of the program for the k-means d& d 2 d& d d& d2 ΣΣ (x-centeri) i x in Ci Min m0 2 ...6.. // 8 SADy are obtained 8 t8×16b . . t8×16b ← 8×16b 8×16b 8 SADxy8×16b [d/8] ← SADxy8×16b [d/8] + t8×16b .weightCentroid+x.. The four pixels are processed in parallel on the fully pipelined circuit. Centers Z) { E ← 0. // store the 8 partial SADy 8 buf4×32b [d/8×2] ← the 1st four data in SADxy8×16b [d/8]. . } Fig.....

t8×16b ← 8 t8×16b ← t8×16b × t8×16b .. In the following comparison. } update cluster centers..8.02 0. when D is less than 62. y). because the distances up to 24 clusters can be calculated in parallel on XC2V6000 (48 on XC4VLX160). . and each of them can be processed by each thread). In XC2V6000. y)}. and for more centers.. the performance by 4 threads is faster than FPGA when the filter size is smaller than 7 (the performance is almost proportional to w ×w). y). 4 min1×32b ← the minimum of four 32b in min4×32b . IG (x.. because the performances are decided by the input speed of image data (though the circuit size for non-separable filters is almost proportional to w × w). The performance gain by 4 threads is about 3.. This processor has quad cores. mem4×32b []}.9 times. IB (x. The performances of FPGAs (Xilinx XC2V6000 and XC4VLX160. 4 for r1×8b ∈ {IR (x. y) { n ← 0. and 241 in XC4VLX160). Performance of the stereo-vision of a 640 ×480 pixel grayscale image. When the filter size is 15 ×15. As for nonseparable filters.7MHz) become stepwise again.. and XC4VLX160 is two times faster than XC2V6000 because it is two times larger. 4 shigh4×32b + the 2nd four data in t8×16b .10 compares the performances of the k-means clustering for an image called monarch (768 ×512 pixel color image). A program for the k-means clustering cache) with 4GB main memory. and the performance of the processor is the average of 1000 runs. for each 8 centers (ck . r1×8b .. but slower than FPGA (about 217 fps). When D is larger than 121. Therefore.. FPGAs are faster than the processor for all tested D. 0 20 40 60 80 100 120 140 160 180 200 220 240 D Fig. In Fig. r1×8b }.0. the same operations are repeated (the performances for k on the same step are not the same in this case because of the iterations in the 81 . the size of window is 7 × 7. y).sec K LUTs 8 0. . 4 shigh4×32b ← 0. //copy 8 times t8×16b ← 8 the values of R. and faster than 30 fps when D ≤ 224.10 fps for each I(x. ck+7 ) in K centers { slow4×32b ← 0. The performance of the processor by 4 threads is about 3.. two windows in Il can be compared with 61 windows in Ir in parallel respectively.8 compares the performance of the filter programs (separable and non-separable) of the processor and a circuit on Xilinx XC4VLX160 (66MHz) for a 640×480 pixel grayscale image. Fig. The performance of the processor by 4 threads is about 3. fill the 8 bit slots with the center numbers (k. IG (x. 9. mem4×32b [n++] ← 4 } mem4×32b [0]. and compiled them using Intel C++ Compiler 10. and the performance can be improved by the multi-thread execution (image data can be easily divided to four subimages.00 0 3 5 7 4 non-separable (4 threads) 9 2 11 13 15 0 filter size (2w+1) Fig.06 non-separable (1 thread) FPGA(66MHz) (non-separable & separable) 0. Fig. GorB of the 8 centers. y). In Fig. the time to download images from main memory is not included.6 times of 1 thread.7 times of 1 thread. 121 windows are compared in the first try.9 compares the performances for the stereo vision 10 0. cn1×8b ← the lowest 8b of min1×32b . .04 separable(4 thread) separable(1 thread) 0. Performance of two-dimensional filters XC4VLX160 speedup XC2V6000 400 XC4VLX160 (speedup) 8 300 6 200 4 XC2V6000 (speedup) 4 threads 100 2 1 thread 0 Fig. 8. 66MHz) become stepwise. 8 slow4×32b ← slow4×32b + the 1st four data in t8×16b . 7. mem4×32b [n++] ← slow4×32b . // cluster number error1×32b ← error1×32b +(min1×32b f >> 8). the performance gain by the FPGAs looks like a sawtooth wave (only the gain over 4 threads is shown). min4×32b ← 4 for all slow4×32b and shigh4×32b stored in mem4×32b [] min4×32b ← min{min4×32b . and the rests are in the second try. The performances of the FPGAs (82. IB (x. The performance of the processor for separable filters are more than 400 fps even with 1 thread.08 FPGA(circuit size) 6 0. and faster than FPGA for all w. the performance is about 43 fps. k+7). a window in Il can be compared with D = 121 windows in Ir in parallel (this is limited by the number of block RAMs. sumRGB4×32b [cn1×8b ]+=4 {IR (x. y)}{ {r1×8b . Therefore. which is fast enough for real-time applications. count1×32b [cn1×8b ]+=1. The performances of FPGA for separable and nonseparable filters are the same and almost constant for all w. Fig. 4 shigh4×32b . shigh4×32b ← 4 } shift slow4×32b and shigh4×32b to the left by 8b.9. u8×16b ← 8 // already stored in 8 ×16b bit wide array t8×16b – u8×16b . (2w+1)×(2w+1) is the size of the filters.

Saegusa.11). 2006. 179–184 [3] A.Mota.16. pp. 2005. The performances of the programs on the processor is not fully tuned up. Niitsuma. et.4. 11. J. T. Estimated Performances and memory throughput 4 threads 1 thread 0 stereo-vision (D =241) 10 large FPGA.108. E. [8] B. real-time applications which can be processed by the processors is still very limited in image processing. IEICE Technical Report.48. We have compared the performances using only three problems. T. [2] H. We have the following issues which have to be considered. we can expect twice the performance though the required memory throughput also becomes twice. 164–166 [9] T. and one clock cycle when K=48). Maruyama. but the performance for a small image called lena (512 ×512 pixels) is 37. [4] J. pp. and processing them in parallel in the same way as the multi-thread execution in the micro processor if the memory throughput allows it. Estlick. and we need FPGAs for practical real-time applications.203-210.M. and two FPGAs using three simple problems in image processing. pp. 7.Diaz. The performance of the processor gradually decreases for larger k. Saegusa.274-279 [5] T. and the memory bandwidth (image data are too large to store on FPGAs). “FPGA-based real-time optical-flow system”. 2008.Pelayo. DISCUSSION AND FUTURE WORKS We have compared the performance of a processor with SIMD instructions and multi-cores. “Hardware-driven adaptive k-means clustering for real-time video imaging”. it takes 6 clock cycles for each pixel when K=256. REFERENCES [1] R. FPGA 2001. IEEE TCSVT. 2007. http://www. “Two dimensional linear filtering”. When we think of the performance improvement of the processors in the near future. pp. However. “Video-rate stereo depth measurement on programmable hardware”.Ortigosa. F. All hardware resources of the processor have to be used to satisfy the real-time requirement. J.15. The performance gain by FPGAs are limited.3 fps for k=8 and faster than the requirement. “How fast is an FPGA in image processing?”. No.309-318 [7] M. Journal of Real-Time Image Processing. Vol.83–88 82 .40 15 1 30 20 speedup fps XC4VLX160 (speedup) speedup 50 memory throughput 2 K-means clustering (K=48) XC4VLX160 3 GB/sec K-means clustering (K=256) 10 30 XC2V6000 XC2V6000 (speedup) filter (non-separable) (15x15) 20 5 XC4VLX160 0 XC2V6000 10 8 16 24 32 40 48 56 64 72 80 88 96 0 112 128 144 160 176 192 208 224 240 256 k 104 120 136 152 168 184 200 216 232 248 Fig. T. E. “An FPGA Implementation of K-Means Clustering for Color Images Based on KdTree”. M. Performance of the k-means clustering algorithm k-means clustering algorithm). The performances of FPGAs are limited by the size of FPGAs. In the comparison. and no resources are left for other works. all the tested problems are preliminary tasks for more sophisticated works.2. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). power consumption and costs are not considered. Issue. Fig.Turney.7 fps. 8. and single core). S. XC4VLX160 is two times faster than XC2V6000.Maruyama.. Leeser. “Real-time Generation of Three-Dimensional Motion Fields”. FPL 2006. No. Szymanski “Algorithmic transformations in the implementation of K-means clustering on reconfigurable hardware”. 2003. “An FPGA implementation of real-time K-means clustering for color images”.Yamaguchi. With an FPGA which is twice as large as XC4VLX160 (the circles in Fig. 10. The performance gain by the FPGAs (over 4 threads) is not so large (5-15 times) (we can not expect drastic gain (more than hundreds) which was possible compared with previous processors with limited SIMD instructions. T.xilinx. The performance of FPGAs can be improved by dividing an image to sub-images. pp. The performance of the processor for k=8 (4 threads) is 16.11 shows the measured and estimated speedup (over 4 threads) for the problems (in the k-means clustering.567-572 [6] T. and the performance gain by the FPGA looks like a sawtooth wave (only the gain over 4 threads is shown). pp. O. Maruyama. Vol. Issue. Springer.Darabiha.Yadid-Pecht. The performance of the processor with quad cores is fast enough for real-time processing (more than 30 fps) when the image size is small. pp.1.2.Ros. Maruyama. Vol. Computer Vision and Pattern Recognition. and we can execute more sophisticated works which take over the task on the FPGA. pp 103 –110. Saegusa. but large amount of hardware resources are still available on a 100 200 300 memory throughput 400 MB/sec Fig. Theiler and J. FPL 2005.Maliatski. and a bit slower than the real-time requirement (20-30 fps).