Professional Documents
Culture Documents
Streaming Multiprocessor
line buffer reg reg reg reg reg
Local Memory
Local Memory
Local Memory
SP
global memory
SP SP SP SP SP x G(2,--1) x G(1,-1) x G(0,1) x G(-1,-1) x G(-2,-1)
line buffer reg reg reg reg reg
networks
SP SP SP SP SP SP ... ... ... ...
x x x x x G(-2,0)
SP SP SP SP SP SP
line buffer reg reg reg reg reg
Input
x ... x ... x ... x ... x G(-2,2)
I(x+w,y+w-1)
Texture Filtering Unit
I(x+w,y+w)
+
L1 cache memory
S(x,y)
Fig. 2. Circuits for non-separable filters
Fig. 1. An overview of Nvidia GTX280
Fig.2 shows a block diagram of a circuit for non sepa-
order to extract the maximum performance of SPs by hid- rable filters when w=2, which is implemented on FPGA. In
ing memory access delay, we need to provide four threads Fig.2, one pixel value I(x+w, y +w) is given to the circuit
for each SP, which are interleaved on the SP. Therefore, every clock cycle, and sent to the register array. At the same
we have to provide 32 threads for each streaming multi- time, data in the line buffers (the number of line buffers is
processor, which amounts to 960 (=240 × 4) threads in to- 2w) are read out in parallel (these data are I(x+w, y +k) k
tal. Streaming multiprocessors are connected to large global = −w+1, ..., w), and given to the register array. The read-
memory through networks. The clock speed of processing out data and I(x+w, y +w) are held on the register array for
units is 1296MHz, and 602MHz for other units. 2w+1 clock cycles and multiplied by G(dx, dy). The prod-
Because of a larger number of cores, the peak perfor- ucts are summed up by an adder tree, and the result becomes
mance of GPU outperforms CPU. However, the program- S(x, y). The read-out data and I(x+w, y +w) are written
ming on GPU is even more tricky, because back to the next line buffers for the calculation of S(x, y +1).
1. access to the local memory from eight SPs in the same This circuit is fully pipelined, and can apply the filter to each
streaming multiprocessor is fast, but access to the global pixel in one clock cycle (as its throughput). The number of
memory is despairingly slow (it takes several hun- operations is (2w+1)×(2w+1) for multiply operations, and
dred clock cycles, though the throughput of the global (2w+1) × (2w+1)−1 for add operations. Using CPU with
memory is fast enough), SIMD instructions, N pixels can be processed in parallel
2. even for the local memory, eight SPs can not access using N cores, and eight I(x+dx, y +dy) can be multiplied
arbitrary addresses in parallel (parallel access to only by G(dx, dy) in parallel in each core. In order to extract
successive addresses (word boundary) is allowed), the maximum performance of GTX280, we need to provide
3. the size of local memory is only 16KB (for 32 threads), at least 960 threads. In this problem, S(x, y) can be com-
4. we have to provide a large number of threads, while puted independently for all pixels. Therefore, by computing
minimizing sequential parts. 960 S(x, y) in parallel (each of them is computed sequen-
This programming is somewhat similar to the hardware de- tially), we can fulfill all pipeline stages of GTX280. This
sign, because we need to make all stages of all pipelined implementation makes it easy to support arbitrary w using
circuits busy by extracting more parallelism, and we also the same program. The number of operations in this compu-
need to minimize the number of memory accesses to off- tation is the same as FPGA and CPU.
chip memory banks in the hardware design.
4. STEREO VISION
3. TWO-DIMENSIONAL FILTERS In the stereo vision system, projections of the same location
Let w be the radius of the filter, and I(x, y) the value of a are searched in two images (Ir and Il ) take by two cameras,
pixel at (x, y) in the image. Then, the computation for ap- and the distance to the location is obtained from the dispar-
plying a non-separable filter to I(x, y) becomes as follows. ity. In order to find the projections, a small window cen-
w
w
tered at a given pixel in Ir is compared with windows in Il
S(x, y) = I(x+dx, y+dy) · G(dx, dy)
dx=−w dy =−w
on the same line (epipolar restriction) in area-based match-
where G(dx, dy) is the coefficient on (dx, dy) ing algorithms. The sum of absolute difference (SAD) is
The computational complexity of this calculation is O(w × widely used to compare the windows because of its simplic-
w). If G(dx, dy) can be rewritten as Gx (dx) · Gy (dy) (this ity. When SAD is used, the value of d in [0, D−1] which
means that the coefficients in the filter can be decomposed minimizes the following equation is searched.
w
w
along the x and y axes), the filter is called separable. The SADxy (x, y, d)= |Ir (x+dx,y+dy)−Il (x+d+dx,y+dy)|
computational complexity of separable filters is O(w), and dx=−w dy =−w
the performance of CPU is faster than FPGA in practical In this equation, (2w+1) × (2w+1) is the size of the win-
range of w [9]. In this paper, we compare the performance dow centered at a pixel Ir (x, y), d is the disparity, and its
of non-separable filters. range D decides how many windows in Il (their centers are
127
SADxy(x-1,k,0) SADxy(x-1,k,1) SADxy(x-1,k,D-1)
Il x+d x+d+1 Ir x x+1
- - SADxy(x,k,0) SADxy(x,k,1) SADxy(x,k,D-1)
- - thread(0, k) SADy(0,k,0) SADy(0,k,1) ... SADy(0,k,D-1)
.....
.....
.....
.....
Sb Sb ... ... ...
Sc Sa y Sc Sa
thread(x-4,k) SADy(x-4,k,0) SADy(x-4,k,1) ... SADy(x-4,k,D-1)
.....
.....
.....
.....
y Sc Sa ... ... ...
128
SADy(x-4,y,d) (A) - (B)
XOneIteration (Point Set S, Centers Z) {
SADy(x-3,y,d) E ← 0;
SADy(x,y,d)
SADy(x-2,y,d)
for each (x ∈ S) {
SADy(x, y+1,d)
SADxy(x,y,d)
SADy(x-1,y,d)
SADxy(x+1,y,d)
SADy(x ,y,d)
SADy(x+1,y,d)
z ← the closest point in Z to x;
SADy(x+2,y,d) z.weightCentroid ← z.weightCentroid+x;
SADy(x+3,y,d) +
SADy(x+4,y,d) z.count ← z.count+1;
E ← E+(x−z)2 ;
SADy(x+5,y,d)
thread(x,y) + thread(x+1,y) thread(x,y) + thread(x,y+1) }
Fig. 5. Optimization methods of the stereo vision on GPU for each (z ∈ Z)
z ← z.weightCentroid/z.count;
In this method, one set of sad y[], sad xy min[] and d min[] return E;
is required. The computational complexity of this method is }
O(X ×Y ×w×D). Fig. 6. One iteration in the simple k-means algorithm
Fig.5 shows a technique to reduce the number of opera-
tions. First, in step 4, 2w+1 SADy in sad y[] are added se- m3 m2
External SRAM banks
m1 m0
quentially. By merging two threads, thread(x, y) and (x+1, y)
as shown in Fig.5(A) (w=3), SADyx (x+1, y, d) can be ob-
cluster number
tained by subtracting SADy (x−3, y, d) from SADxy (x, y, d),
d& d2 95
d& d2 74
d& d2 73
d& d2 72
d& d2 71
d& d2 24
d& d250
d& d249
d& d248
d& d247
d& d226
d& d225
d& d223
2
1
0
d& d2
d& d2
d& d2
........ ........ ........ ........
+1
+1
+1
+1
ΣΣ (x-centeri) 2
Converge?
threads for one streaming multiprocessor to fulfill the pipeline
Div
i x in Ci
stages. Therefore, we can merge up to X/32 threads. Sec-
|Ci|
ΣΣ (x-centeri)
....
....
....
....
....
....
....
....
ond, in step 2, the absolute differences of 2w+1 pixels are
....
....
i x in Ci
x in Ci Sum
x in Ci
|Ci|
Σ (x-centeri )
Next Images
and (x, y +1) as shown in Fig.5(B), SADy (x, y +1, d) can FPGA
PCI Bus
be obtained by subtracting the absolute difference of the External SRAM banks
pixel with ’-’ from SADy (x, y, d) and adding that of the m7 m6 m5 m4
pixel with ’+’ to the result. The size of each array used in Fig. 7. A circuit for the simple k-means clustering algorithm
this method is X. When 512<X ≤1024, two sets of those
arrays can be implemented in one local memory and two the points which belongs to each cluster. These operations
threads can be merged. However, in our experiments, the are repeated until no improvement of E is obtained.
computation method which uses half of the local memory By applying the k-means clustering algorithm to color
to cache pixel data shows better performance, and no local images, we can reduce the number of colors in the images to
memory is left for merging the two threads. Then, the av- K while maintaining the quality of the images. Fig.7 shows
erage number of operations per pixel is (2w+1)×D for ab- a block diagram of a circuit for the simple k-means cluster-
solute differences, (2w+(2w+2×(X/32−1))/(X/32))×D ing algorithm for 24-bit full color RGB images. In Fig.7,
for add/subtract operations, and D−1 for comparison to find pixels in one image are stored in three memory banks (m0,
the minimum, which is 3.5X for absolute differences, 2.05X m1 and m2), and four pixels in the three memory banks are
for add/subtract operations and 1X for comparisons of the read out at the same time (data width of four pixels is 24b×
computation method used for FPGA when w=3 and X =640. 4, and the data width of three memory banks is 32b×3) every
clock cycle. The four pixels are processed in parallel on the
5. K-MEANS CLUSTERING
fully pipelined circuit, and the results (cluster numbers for
Given a set S of D-dimensional points, and an integer K, the four pixels) are stored in m3. After processing all pix-
the goal of the k-means clustering is to partition the points els, four partial sums stored in on-chip memory banks are
into K subsets (Ci (i = 1, K)) so that the following error summed up to calculate new centers. While processing one
function is minimized. image using four memory banks, next image can be down-
K
loaded to other memory banks. In order to find the closest
E= (x−centeri )2
center, squared Euclidean distances to K centers have to be
i=1 x∈Ci
where centeri = x∈Ci x / |Ci | (the mean of points in Ci ). calculated for each pixel. Let (xR , xG , xB ) be the value of
Fig.6 shows one iteration of the simple k-means clustering a pixel. Then, its squared Euclidean distance to centeri is
algorithm. First, squared distances to K centers are calcu- d2 = (xR −centeriR )2 +(xG −centeriG )2 +(xB −centeriB )2
lated for each point in the dataset and the minimum of them and, we need three multipliers to calculate one distance. In
is chosen (the point belongs to the cluster which gives the Fig.7, 96 × 3 units to calculate squared distance are used,
minimum distance). Then, new centers are calculated from which means that 96 squared Euclidean distances can be cal-
129
culated in parallel (24 d2 for each pixel because four pixels 5000 fps
130
1000
600
400 FPGA(XC4VLX160 100MHz)
FPGA(XC4VLX160 100MHz)
300 400
GPU(GTX280)
CPU - 4 threads(Exterme QX6850)
200
GPU(GTX280) non-optimized 200
131