You are on page 1of 12

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO.

1, JANUARY 2010 15

FPGA Design and Implementation of a Real-Time


Stereo Vision System
S. Jin, J. Cho, X. D. Pham, K. M. Lee, S.-K. Park, M. Kim, and J. W. Jeon, Member, IEEE

Abstract—Stereo vision is a well-known ranging method be- In spite of its usefulness as a range sensor, stereo vision
cause it resembles the basic mechanism of the human eye. has limitations for real-life applications due to its considerable
However, the computational complexity and large amount of data
computational expense [6]. The instruction cycle time delay
access make real-time processing of stereo vision challenging be-
cause of the inherent instruction cycle delay within conventional caused by numerous repetitive operations causes the real-time
computers. In order to solve this problem, the past 20 years of processing of stereo vision to be difficult when using a conven-
research have focused on the use of dedicated hardware architec- tional computer. For example, several seconds are required to
ture for stereo vision. This paper proposes a fully pipelined stereo execute a medium-sized stereo vision algorithm for a single
vision system providing a dense disparity image with additional
pair of images on a 1 GHz general-purpose microprocessor
sub-pixel accuracy in real-time. The entire stereo vision process,
such as rectification, stereo matching, and post-processing, is [7]. This low frame rate results in limited applicability of
realized using a single field programmable gate array (FPGA) stereo vision, especially for real-time applications, because
without the necessity of any external devices. The hardware quick decisions must be made based on the vision data [1].
implementation is more than 230 times faster when compared To overcome this limitation, various approaches have
to a software program operating on a conventional computer,
been developed since the late 1980s in order to perform
and shows stronger performance over previous hardware-related
studies. the 3-D depth calculation with stereo vision in real-time
using hardware-based systems such as digital signal proces-
Index Terms—Field programmable gate arrays, integrated
sors (DSP), field programmable gate arrays (FPGA), and
circuit design, stereo vision, video signal processing.
application-specific integrated circuits (ASIC). Kimura et al.
[8] designed a convolver-based nine-eye stereo machine called
SAZAN which generates dense stereo depth maps with 25
I. Introduction stereo disparities with 320 × 240 images at 20 frames per
second (f/s). Woodfill et al. [1] developed a DeepSea stereo
TEREO vision is a traditional method for acquiring
S 3-D information from a stereo image pair. Stereo vision
has many advantages over other 3-D sensing methods in
vision system based on the DeepSea processor, an ASIC,
which computes absolute depth at 200 f/s with a 512 × 480
input image pair and 52 stereo disparities. FPGA-based stereo
terms of safety, undetectable characteristics, cost, operating
vision systems have also been introduced due to the rapid
range, and reliability [1]. For these reasons, stereo vision is
development of programmable devices. Darabiha et al. [9]
widely used in many application areas including intelligent
developed a phase-based stereo vision design which generates
robots, autonomous vehicles, human–computer interfaces, and
20 stereo disparities using 256 × 360 pixel images at 30 f/s
security and defense applications [2]–[5].
by four Xilinx XCV2000E FPGAs. Jia et al. [10] developed
Manuscript received October 29, 2008; revised January 16, 2009. First MSVM-III, which computes trinocular stereopsis using one
version published July 7, 2009; current version published January 7, 2010. Xilinx XC2V2000 FPGA. This system runs at approximately
This research was performed for the Intelligent Robotics Development Pro- 30 f/s with 640 × 480 pixel images within a 64 pixel disparity
gram, one of the 21st Century Frontier Research and Development Programs,
funded by the Ministry of Commerce, Industry and Energy, Korea. This paper search range. Even though considerable progress has been
was recommended by Associate Editor Y.-K. Chen. made, improvements are still needed because the previously
S. Jin and J. W. Jeon are with the School of Information and Communication proposed systems have various deficiencies in regard to the
Engineering, Sungkyunkwan University, Suwon, Gyeonggi-do 440-746, Korea
(e-mail: coredev@ece.skku.ac.kr; jwjeon@ yurim.skku.ac.kr). stereo matching method, frame rate, scalability, and one-chip
J. Cho is with the Department of Computer Science and Engineer- integration of pre and post-processing functions. All of the
ing, University of California, San Diego, CA 92093-0404 USA (e-mail: real-time stereo vision systems developed so far have only
jucho@cs.ucsd.edu).
X. D. Pham is with the Information Technology Faculty, Saigon Institute of partially satisfied the requirements in terms of an architectural
Technology, Hochiminh City, Vietnam (e-mail: xuanpd@saigontech.edu.vn). point of view.
K. M. Lee is with the Department of Electrical and Computer En- This paper proposes dedicated hardware architecture for
gineering, Seoul National University, Seoul 151-742, Korea (e-mail: ky-
oungmu@snu.ac.kr). real-time stereo vision and integrates it within a single chip to
S.-K. Park and M. Kim are with the Center for Intelligent Robotics, overcome these limitations. The entire stereo vision process,
Korea Institute of Science and Technology, Seoul 136-791, Korea (e-mail: including rectification, census transform, stereo matching, and
skee@kist.re.kr; munsang@kist.re.kr).
Digital Object Identifier 10.1109/TCSVT.2009.2026831 post-processing is designed and implemented using a single

c 2010 IEEE
1051-8215/$26.00 

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
16 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

FPGA. Regarding the issue of scalability, the proposed system applied to solve the correspondence problem. Since pixels with
is intensively pipelined and synchronized with a pixel sam- the same intensity value can appear many times within the
pling clock. In general, a sequential bottleneck is eliminated image, a group of pixels, called windows, is used to compare
when we synchronize the entire system with an individual the corresponding points. Several stereo vision algorithms
data element, rather than on the flow of control. As a result, were introduced based on different correspondence methods–
the overall performance improves [11]. For the vision-related local and global. The local method includes block matching,
systems, in the same manner, the throughput of the imple- feature matching, and gradient-based optimization, while the
mented system is continuously increased following the input global method includes dynamic programming, graph cuts,
pixel frequency, unless the pixel frequency does not exceed the and belief propagation. Because we can restrict the searching
maximum allowed frequency of the design. For this reason, the range of stereo matching to one-dimension, local methods can
implemented system can generate disparity images of VGA be very efficient compared with global methods. Moreover,
(640 × 480) resolution at a conventional video-frame rate, 30 even though local methods experience difficulties in locally
and 60 f/s, with a pixel clock of 12.2727 and 24.5454 MHz, ambiguous regions, they provide acceptable depth information
respectively. Because no other processing clock is used, the for the scene with the aid of accurate calibration [13]. For
implemented system can generate disparity images with the this reason, we employed the local stereo method as the cost
same resolution at the designated frame rate when connected function of the proposed system. In particular, we used census-
to the cameras with different pixel clock frequencies without based correlation because of its robustness to the random noise
the requirement for adaption of the internal system design. As within a window and its bit-oriented cost computation [19].
a result, the implemented system is flexible with respect to the The census transform maps the window surrounding the
camera input parameters, such as frame rate and image size. pixel p to a bit vector representing the local information about
The remainder of this paper is organized as follows: Sec- the pixel p and its neighboring pixels. If the intensity value
tion II presents common background information about the of a neighboring pixel is less than the intensity value of pixel
stereo vision process, including pre and post-processing, Sec- p, then the corresponding bit is set to 1, otherwise it is set to
tion III gives a detailed description about the implementation 0. The dissimilarity between two bit strings can be measured
of the proposed real-time stereo vision system, Section IV through the hamming distance, which determines the number
discusses and evaluates the experimental performance results, of bits that differ between these two bit strings. To compute
and Section V draws the conclusion. the correspondence, the sum of these hamming distances over
the correlation window is calculated in (3), where I1′ and I2′
represents the census transform of template window I1 and
II. Stereo Vision Background
candidate window I2
Stereo vision refers to the issue of determining the 3-D
structure of a scene from two or more images taken from Hamming(I1′ (u, v), I2′ (x + u, y + v)). (3)
distinct viewpoints [13]. In the case of a binocular stereo, the (u,v)∈W

depth information for the scene is determined by searching the The results for stereo matching provide a disparity, which
corresponding pairs for each pixel within the image. Since this indicates the distance between the two corresponding points.
search method is based on a pixel-by-pixel comparison, which For all possibilities, the disparity should be computed and
consumes much computational power, most stereo matching evaluated at every pixel within the searching range. However,
algorithms have assumptions about the camera calibration and reliable disparity estimations cannot be calculated on surfaces
epipolar geometry [14]. with no texture or repetitive texture when utilizing a local
An image transformation, known as rectification, is applied stereo method [20]. In this case, the disparities at multiple
in order to obtain a pair of rectified images from the original locations within the image may point to the same location in
images as in (1)–(2). In the equations, (x, y) and (x′′ , y′′ ) the other image despite each location within one image being
are the coordinates of a pixel in the original images and assigned, at most, one disparity. This unique test method tracks
the rectified images, respectively. To avoid problems such the three smallest matching results, v1 , v2 , and v3 , instead of
as reference duplication, reverse mapping is used with inter- seeking only the minimum. The pixel has a unique minimum
polation. Once the image pair is rectified, 1-D searching with if the minimum, v1, lies between v2 and v3 , as in (4), where
the corresponding line is sufficient to evaluate the disparity N is an experimental parameter. Usually, the value of N lies
[15]–[18] between 1.25 and 1.5; we used a value of 1.33 in all our
⎡ ′ ⎤ ⎡ ′′ ⎤ ⎡ ′ ⎤ ⎡ ′′ ⎤ experiments
xl xl xr xr
⎢ ′ ⎥
y = H −1 ⎢ ′′ ⎥
y ,
⎢ ′ ⎥
y = H −1 ⎢ ′′ ⎥
y (1) v2 > Nv1 and v3 > Nv1 . (4)
⎣ l ⎦ l ⎣ l ⎦ ⎣ r ⎦ r ⎣ r ⎦
′ ′ The occlusions can also mislead the disparity result since
zl ⎡1′ ⎤ zr 1
  xl   ⎡ xr′ ⎤ a scene is captured with two different viewpoints and no
xl ′
⎢ zl ⎥ xr z′r correlated information exists between the image pair. The
= ⎣ ′ ⎦, = ⎣ ′ ⎦. (2)
yl yl yr yr left–right consistency check (LR-check) occludes the points
z′l z′r
where the two images are not negatives of each other [21].
After rectification, stereo matching (given a pixel in one An experimental parameter, T, is compared with Error in (5)
image, find the corresponding pixel in the other image) is and (6) in order to evaluate the disparity result, where xr is

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM 17

Fig. 1. High-level hardware architecture of the proposed real-time stereo vision system.

the right coordinate, xl′ is an estimated left match at disparity the size of the image. In our experiments, we used 400 as a
dxrr′ , and dxl ′ is the left-based estimated disparity threshold value
l

xl′ = xr + dxrr′ (5) i = L given j ∈ N(i)


j=L (8)

Error = xr − (xl′ + dxl ′ ). (6)


dj − di
≤ 1.
l

The sub-pixel estimation adds additional accuracy to the


disparity result using parabola-fitting, f (x) = a + bx + cx2 .
Since the sub-pixel can be found at x = −b/2, the sub-pixel III. Hardware Implementation of the Stereo
x can be obtained as (7), where C(d) and C(d ± n) is the Vision
correlation value at position d and d ± n, respectively
As described above, the computational complexity of stereo
−2 −1 +1 +2 vision is easily calculated, especially with a census transform-
7(2C(d )+C(d )−C(d )−2C(d ))
x = d+ . (7) based method. When assuming m × m and n × n windows for
20(2C(d −2 )−C(d −1 )−2C(d)−C(d +1 )+2C(d +2 )) the census transform and correlating with a searching range, r,
For enhancing the disparity image, spike removal and sub- the required operations to compute the disparity are as follows:
pixel estimation can be used, as described in [22] and [26]. (r + 1) · m2 · (n2 − 1) times of pixel subtraction, r · m2 times
The spike removal eliminates small spikes in the disparity map of hamming distance measurement for n2 − 1 length of bit
by regarding them as useless information. The spike is locally vector, and r · m2 times of hamming distance summation.
stable but not large, and has no support from the surrounding A number of additional operations are required for pre and
surfaces. Since the true surface within the stereo image is post-processing, as described above. Considering the image
commonly part of a larger surface, the surface label, L, can size, these operations can be characterized as very repetitive
be assigned to each pixel. Equation (8) shows the assignment and time-consuming, especially for a conventional computer.
process of a specific pixel, i, where Ni is a neighborhood pixel This sequential bottleneck, which is caused by many memory
around i, and di is the disparity of pixel i. After the label is references, causes difficulty in meeting real-time performance
assigned, the number of pixels with a given label, L, is counted. for the entire stereo vision system.
The pixel is regarded as a spike and removed while generating To achieve real-time performance while handling the stereo
the disparity result, if the number of pixels with given label vision procedure presented above, we designed a dedicated
is smaller than a threshold. The threshold usually depends on hardware architecture implemented in an FPGA. Compared

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
18 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

with an instruction cycle mechanism in a conventional com-


puter, an FPGA can process data more efficiently via a
customized datapath. Fig. 1 describes the overall system
architecture. All of the stereo vision process is performed by
passing a single datapath controlled by a controller module.
The datapath structure is defined by three modules as follows:
rectification, stereo matching, and post-processing.
For the in-depth implementation of the dedicated datapath,
the following design decisions were made: intensive use of
parallelism and pipelining, single pixel clock synchronization,
and elimination of the use of an external image buffer. Because
parallelism and pipelining are intrinsic resources of an FPGA,
it is important to fully utilize these resources to improve
computational performance. To achieve parallelism, we divide Fig. 2. Realization of the rectification process using reverse mapping.
each module into a set of simpler functional elements. Then,
each functional element is replicated and executed in parallel,
as much as possible. For example, we place r numbers of ham- shows the procedure for calculating the reverse mapping
ming distance modules to get the r hamming distances, instead coordinate, as described in (1)–(2). The resulting coordinate
of repetitively executing the single hamming distance module can be larger or smaller than the reference coordinate due to
r times. As stereo matching has many repetitive operations, the characteristics of reverse mapping. However, storing the
because of the window-based image processing, parallelizing entire image is wasteful, both in terms of memory usage and
these repetitions can greatly improve overall performance. latency, because the common assumption about the camera
Synchronization of all modules with a single pixel clock alignment is already made in binocular stereopsis. For this
is another important design decision to achieve good perfor- reason, we assume that the pixels are adjusted vertically within
mance and scalability. From the viewpoint of the camera, ±50 scan-lines at maximum. We use dual-port memory as
the frequency of the pixel clock is related to the number a circular line buffer to minimize waste while compensating
of pixels transmitted, which in turn depends on the frame for the coordinate difference. In addition, the reading point
rate and the image resolution. Assuming an imaginary sensor of time, Ts , in (9) is used by measuring the maximum
that uses a 12 MHz clock as the pixel clock to generate a and minimum reference address value to maximize memory
640 × 480 gray level images at 30 f/s, a 48 MHz pixel clock utilization.
would theoretically allow this sensor to generate images with Bu and Bl represent the upper and lower boundary values
size four times that of the original image at the same frame of the pixel adjustment range. y′′ is the output pixel coordi-
rate, or generate the same size images at four times of the nate and H(y′′ ) is its reference pixel coordinate. Ln is the
original frame rate. That is, the image size and frame rate maximum number of lines that can be stored to the line
depend directly on the pixel clock. Therefore, the implemented buffer. Ih and Iw represent the size of image height and width,
system becomes flexible with respect to the resolution and respectively. Since Ts is constant until the rectification matrix
frame rate of the camera if we synchronize the system with is constant, Ts is calculated when off-line and transferred to the
the pixel clock of the camera. system during the configuration of the rectification matrix after
Once the design is synchronized with pixel clock, it is calibration
important to increase the maximum allowed frequency, to
maximize the throughput of the system. For this reason, the


Ih


Ih


parallelized modules are divided again into a set of pipeline ′′ ′′



MAX(y′′ − Hl (y′′ ))

Bu = MAX

MAX
′′
(y − H r (y ))
,
y′′ =0
y =0

stages, considering the size and complexity of the operation.


In addition, an external image buffer is excluded, because


Ih


Ih


′′ ′′

MIN(y′′ − Hl (y′′ ))

it requires more than one clock cycle when fetching data Bl = MAX
MIN

(y − H r (y ))
,
y′′ =0
y′′ =0

from external memory devices. Even though this intensive


pipelining causes additional use of routing resources, it helps Ln −(Bu + Bl )

Ts = Bl + 2
× Iw .
the overall throughput. After the initial pipeline latency, the (9)
disparity for the corresponding input is generated in real-time After initial latency, Ts , rectification using backward map-
and synchronized with the pixel clock. A detailed description ping can be performed to synchronize with the pixel input
of each module is given below. using the pixel coordinates of the reference image (x, y), which
are calculated from the rectification process based on the
A. Rectification Module incremental output pixel coordinates (x′′ , y′′ ). Based on the
As described in Section II, rectification is performed based calculated coordinate, the pixel output for the rectified image
on the matrices generated when the camera is calibrated. is read from the line buffer and sent to the stereo matching
Since the camera calibration is done off-line, the rectification module. As a result, the stereo matching module is guaranteed
module takes the user input through an external commu- to contain the rectified image input, which satisfies the epipolar
nication port and initiates the rectification process. Fig. 2 geometry.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM 19

Fig. 3. Hardware architecture for stereo matching (census transform/correlation).

Fig. 5. Hamming distance computation using the adder tree.

Fig. 4. Hardware architecture for the parallel processing of window pixels. registers which contain the pixels in the window. Since the
window buffer consists of registers, it guarantees instant access
to its elements.
B. Stereo Matching Module The scan-line buffer used in the proposed system consists
After the rectification process, stereo matching is performed of 10 dual-port memories, and each memory can store one
between the two corresponding images. As shown in Fig. 3, scan-line of an input image. Assuming that the coordinates
stereo matching is divided into two stages, the census trans- of the current input pixel are (x, y) and the intensity value
form stage and the correlation stage. In the census transform of the pixel is I(x, y), the connections between the memory
stage, the left and right images are transformed into images are shown in Fig. 4. Fig. 4 also shows the scan-line buffer
with census vector pixel values instead of gray-level intensity conceptually converting a single pixel input into a pixel
pixel values. We use an 11 × 11 window size for the census column vector output. A window buffer is a square-shaped
transform to generate a 120 bit long transformed bit vector per set of shift registers, and each register can store one pixel
each pixel. Since the census transform is a type of window- intensity value in the input image. The right-most column of
based processing, the neighboring pixels of the processing the window buffer acquires the pixel column vector, which is
pixel need to be accessed simultaneously. To achieve this, we generated from the scan-line buffer, and the intensity values
use a scan-line buffer and a window buffer. A scan-line buffer in each register are shifted from right to left at each pixel
is a buffer which is able to contain several numbers of scan- clock. As a result, all 11 × 11 pixels exist in the window buffer,
lines in the input image, and a window buffer is a set of shift and each pixel can be accessed simultaneously. After a pixel

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
20 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Fig. 6. Hardware architecture of the LR-check module.

Fig. 7. Realization of the sub-pixel estimation using parabola fitting method.

Fig. 8. Hardware architecture of the spike removal module.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM 21

TABLE I
Coefficients of Interpolation Filter

Used Available Utilization


Logic utilization
Number of slice flip flops 53,616 178,176 30%
Number of four input LUTs 60,598 178,176 34%
Logic distribution
Number of occupied slices 51,191 57%
Rectification 4,035 4%
Census transform 7,076 89,088 8%
Hamming distance 3,792 4%
Correlation 24,526 27%
Post-processing 10,744 12%
Fig. 9. Implemented real-time stereo vision system based on FPGA.
Number of FIFO/RAMBs 322 95%
Rectification 64 19%
Census transform 10 336 2%
Correlation 182 54%
Post-processing 66 17%
intensity has been assigned to each register within the window Number of DSP48s 12 96 12%
Number of DCM-ADVs 3 12 25%
buffer, the census vector can be obtained by comparing the
intensity of the centered pixel, Win(xc , yc ), and its neighboring
pixels in the window. As a result, an image whose pixel has
local information about the relation between each pixel and C. Post-Processing Module
its neighborhood is generated and delivered to the correlation The post-processing module contains four sub-modules that
stage. increase the accuracy in the resultant disparity images as
The correlation stage evaluates the correlation between the follows: LR-check, uniqueness test, spike removal, and sub-
census vectors generated by the left and right census transform pixel estimation. As described in Section II, the LR-check is
using a window-based method, similar to the one for the required to remove mismatches caused by occlusion. Since the
census transform stage. By gathering the census vector within left to right matching has been completed, only the search to
the window size of 15 × 15, we can build a more robust determine the right to left disparities has to be performed by
representative of each pixel contained within the image. In the reusing the matching results [23], [27]. For example, the stereo
correlation stage, 64 hamming distances, for the preset range matching is performed using the pixel in the left, pl (x, y), and
r = 64, are evaluated using a template window for a pixel in the set of pixels in the right image, pr (x, y)–pr (x + 64, y),
the left image and the corresponding 64 correlation windows at the given time t. At the next time t + , the matching is
for pixels in the right image. After the comparison, the two performed using pl (x + 1, y) and pr (x + 1, y)–pr (x + 65, y), and
pairs with the shortest hamming distances are used to define so on. For this reason, the matching result of pr (x, y) with
the resulting disparity. Since the windows being compared can pl (x − 64, y)–pl (x, y) is available at t. As shown in Fig. 6,
be regarded as bit vectors, it is possible to obtain the hamming the disparity values are aligned vertically when compared to
distance by counting ‘1’ in the vector after applying an XOR the reference image direction. Conversely, the disparity values
operation. for a pixel within the other image are aligned in a diagonal
In order to decide upon the disparity result, the template direction.
window in the left image is compared with all 64 candidate In the proposed system, the possible pixel range is ±64
windows from the right image. Since the size of the window due to a disparity range of 0–63. For this reason, 64 dual-port
is 15 × 15 and each element of the window is a census vector, memories, which can store 128 hamming distances values, are
the hamming distance between any two windows can be used to perform the LR-check. The address for the output port
obtained as shown in Fig. 5. First, the census vector from the of each dual-port memory is obtained based on the disparity
census transform module is delayed for 64 pixel clocks. Next, value found in the stereo matching. During each pixel clock
the distance between any two census vectors is calculated cycle, the output of each dual-port memory is compared with
instead of building a correlation window. Since the range of each other in order to determine the shortest hamming distance
the hamming distance between two pixels can be from 0 to and its index. By comparing the two disparities, one from the
120, the result is represented by 7 bits and regarded as the vertical direction and the other from the diagonal direction, we
input of the correlation window. The hamming distances are can decide whether this disparity passes the LR-check. Fig. 6
obtained by accumulating all the pixel values in the correlation shows the block diagram of the LR-check module.
window, similar to the census transform stage. We can use The uniqueness test is also performed to validate the dispar-
the tournament selection method to find the shortest distance ity result generated by the stereo matching module. In order
among these 64 hamming distances. The candidate window, to evaluate the uniqueness of the disparity at each position,
which has the shortest distance from the template window, is we need to keep track of the three lowest ranking hamming
selected as the closest match, and the coordinate difference of distances, C(di ), C(di − 1) and C(di + 1), where C(di ) is the
the selected windows along with the x-axis is extracted as the hamming distance at the selected disparity result. Following
disparity result. the simple comparison described in Section II, we can check

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
22 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

TABLE II
Performance Summary of Reported Stereo Vision Systems

Implemented system Image size Matching method Disparity range Window size Rectification Frames per second
SAZAN 320 × 240 SSAD 25 13 × 9 No 20
FPGA
Kuhn et al. SSD/ Census (3 × 3)
ASIC 256 × 192 Census 25 Corr (10 × 3) No 50
Darabiha et al.
256 × 360 Local weighted 20 N/A No 30
FPGA
MSVM-III SSAD/
FPGA 640 × 480 Trinocular 64 N/A Hard-wired 30
DeepSea Census (N/A)
ASIC 512 × 480 Census 52 Corr (N/A) Firmware 200
Software program Census (11 × 11)
3.2 GHz, SSE2 640 × 480 Census 64 Corr (15 × 15) Software 1.1
This paper Census (11 × 11)
FPGA 640 × 480 Census 64 Corr (15 × 15) Hard-wired 230

TABLE III
Evaluation Result of the Proposed System Using Middlebury Stereo-Pairs

Stereo-pair Ground truth Disparity Bad pixels (δd = 1.0, nonocc) Percent of
bad pixels

nonocc: 9.79
all : 11.56
disc : 2029

nonocc: 3.59
all : 5.27
disc : 36.82

nonocc: 12.50
all : 21.50
disc : 30.57

nonocc: 7.34
all : 17.58
disc : 21.01

Average percent of bad pixels: 17.24

whether the selected disparity is a unique minimum, double the uniqueness test module, with C(di −2) and C(di +2) giving
minimum or non-unique minimum [22], [24]. additional precision. By using the parabola fitting method
If a disparity result passes the LR-check and uniqueness test, in (7), the sub-pixel estimation is performed with ease. To
sub-pixel estimation and spike removal are applied to increase decrease the complexity and latency of the operations, the
the reliability and accuracy. The sub-pixel estimation again combination of shifting, addition, and subtraction is used for
utilizes C(di ), C(di − 1) and C(di + 1), which was tracked in each performance of multiplication and division, as shown in

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM 23

Fig. 10. Resultant disparity image with a different post-processing level. Images are captured online under the common indoor scenery. (a) Raw image (Left),
(b) disparity result, (c) LR-check, uniqueness test, sub-pixel, and (d) final disparity with spike removal.

Fig. 7. After a total 23 cycles of clock latency, a calculated The performance verification of the proposed system is
sub-pixel value is obtained and synchronized with the pixel completed using cameras with different frames per second
clock. values. Both the 30 and 60 f/s cameras passed the work-
Spike removal is implemented based on the window pro- ing test just under 12.2727 and 24.5454 MHz, respectively.
cessing architecture, and described as part of the stereo match- In the same manner, we can expect the maximum per-
ing module. Started by the label “1,” the disparity output formance of the proposed system to occur when using a
generated by the stereo matching module is compared with camera with 230 f/s, especially when considering the under-
its neighboring disparity values and labeled by following (8). estimation characteristic of the synthesis tool [12]. Since
Generally, the spike occupies only a small region and does not the maximum frame rate of the camera available for this
have concave behavior. For this reason, we can skip the sorting research is only 60 f/s, the performance of the proposed
of the equivalent label pairs to simplify the removal process. system is verified through the timing simulation using Mentor
Fig. 8 shows a block diagram of the implemented spike Graphics ModelSim 6.1f as the simulation environment with
removal module. While counting the number of disparities the test vectors which describe the actual behavior of the
with identical labels, each disparity value is stored within the camera.
line buffer because the vertical size of the region should be The software program corresponding to the proposed system
considered. If the non-spike condition is satisfied, the resultant was implemented using ANSI-C language for the use of per-
disparity value is delivered to the system output, including formance and disparity quality comparison. Several optimiza-
any latency caused by the line buffer. Otherwise, the disparity tion methods and processor-specific functions, such as mul-
value is replaced with a blank. timedia extensions (MMX) and streaming single instruction
multiple data extensions 2 (SSE2), are considered for the full
utilization of the hardware resources [25]. The experimental
IV. Experimental Results environment for the software program is described as follows:
Microsoft Visual C++ 7.0, Microsoft Windows XP SP2 Profes-
The proposed real-time stereo vision system is desig- sional, Pentium 4 3.2 GHz, and 2 GB DDR SDRAM. Table II
ned and coded using VHDL and implemented using a shows the performance comparison result. As shown in Table
Virtex-4 XC4VLX200-10 FPGA from Xilinx. Fig. 9 shows II, the proposed system, even though it was implemented in
the implemented system. The implemented system inter- an FPGA, shows improvements over the previous systems in
faces two VCC-8350CL cameras from the CIS corporation both performance and integration.
through standard camera-link format, or directly controls a Table III shows the evaluation results of the cost function
MT9M112 CMOS sensor from Micron as a stereo camera which was used in the proposed system. We measured the
pair. Table I summarizes the device utilization reports from percentage of bad pixels in the Middlebury stereo-pairs, which
the Xilinx synthesis tool in ISE release 9.1.03i, XST J.33. consist of Tsukuba, Venus, Teddy, and Cones [26]. The per-
The number of used slices and the maximum allowed fre- centage of bad pixels was obtained as in (10), where dC(x, y)
quency of the proposed system are 51,191 (about 57% of and dT (x, y) are the computed disparity map and the ground
the device) and 93.0907 MHz, as shown in Table I. Since truth map, respectively, N is the total number of pixels, and
the system is based on the local pixel processing, the size δd is the disparity error tolerance. We used δd = 1.0 for further
of the input image does not affect the device utilization comparison with the state-of-the-art stereo vision algorithms
directly. However, both the disparity range and window size on the Middlebury stereo evaluation page
are required to be adjusted to maintain the resulting disparity
quality if the size of the image is changed. Among the sub- 1
B= (|dC (x, y) − dT (x, y)| > δd ). (10)
modules listed in Table I, the amount of the logic resources N (x,y)
consumed by census transform and correlation modules are
linearly increased as the disparity range and window size Since the hardware was built for real-time processing of
increase. The other modules of the system are relatively less an incoming image, the disparity results of the proposed
affected. design were generated through HDL functional simulation,

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
24 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Fig. 11. Online results of the proposed stereo vision system. (a) Raw image (left), (b) disparity result (proposed), (c) disparity result (bumblebee), (d) raw
image (left), (e) disparity result (proposed), and (f) disparity result (bumblebee).

using the disparity search range limitation for each stereo- format of video signals as their input and output. Because the
pair. As shown in the figure, the proposed system provided video signals define the format of the image, each module
acceptable disparity accuracy, considering the real-time per- can be considered as an independent imaging sensor. As a
formance. result, the flexibility of the entire system, as well as individual
Fig. 10 is the resultant disparity image captured in a modules, increased with respect to resolution and frame rate.
common indoor environment. The images were processed in In addition, we can expect performance enhancement since
real-time and obtained online from the implemented system at the proposed system is designed to be synchronized on indi-
different post-processing levels. As the post-processing levels vidual pixel data flow, which results in the elimination of the
advanced, the uncertain pixels were removed to increase the sequential bottleneck caused by instruction cycle delay.
confidence of the disparity result. Because the disparity truth In the future, the proposed real-time stereo vision system
of the common scene is difficult to obtain in real-time, a will be implemented with an ASIC for the purpose of commer-
qualitative comparison was performed using a commercial cial use. Considering the efficiency of implementing gate logic
product, Digiclops stereo vision software with a Bumblebee within an ASIC, we can expect performance enhancement with
camera from Point Grey Research, Inc. (PGR). To compare both quality and speed. The proposed system can be used
the disparity results with those of a similar configuration, for higher level vision applications such as intelligent robots,
the Digiclops stereo vision system was configured to have surveillances, automotives, and navigation. The additional
a disparity range of 0–63 and a 21 × 21 stereo mask. As application areas in which the proposed stereo vision system
shown in Fig. 11, the proposed system generated disparity can be used will continue to be evaluated and explored.
results comparable to the commercial product, with the real-
time performance. References
[1] J. I. Woodfill, G. Gordon, and R. Buck, “Tyzx DeepSea high speed stereo
V. Conclusion vision system,” in Proc. IEEE Comput. Soc. Workshop Real-Time 3-D
Sensors Use Conf. Comput. Vision Pattern Recog., Washington D.C.,
In this paper, we have built an FPGA-based stereo vision Jun. 2004, pp. 41–46.
system using census transform, which can provide dense dis- [2] M. Bertozzi and A. Broggi, “GOLD: A parallel real-time stereo vision
system for generic obstacle and lane detection,” IEEE Trans. Image
parity information with additional sub-pixel accuracy in real- Process., vol. 7, no. 1, pp. 62–81, Jan. 1998.
time. The proposed system was implemented within a single [3] A. Bensrhair, M. Bertozzi, A. Broggi, A. Fascioli, S. Mousset, and G.
FPGA including all the pre and post-processing functions such Toulminet, “Stereo vision-based feature extraction for vehicle detection,”
in Proc. IEEE Intell. Vehicle Symp., Paris, France, vol. 2. Jun. 2002, pp.
as rectification, LR-check, and uniqueness test. The real-time 465–470.
performance of the system is verified under a common real-life [4] N. Uchida, T. Shibahara, T. Aoki, H. Nakajima, and. K. Kobayashi,
environment to measure the further applicability. “3-D face recognition using passive stereo vision,” in Proc. IEEE Int.
Conf. Image Process., Genoa, Italy, vol. 2. Sep. 2005, pp. 950–953.
To achieve the targeted performance and flexibility, we have [5] D. Murray and C. Jennings, “Stereo vision-based mapping and navi-
focused on the intensive use of pipelining and modularization gation for mobile robots,” in Proc. IEEE Int. Conf. Robotics Autom.,
while designing the dedicated datapath of the proposed stereo Albuquerque, NM, vol. 2. Apr. 1997, pp. 1694–1699.
[6] J. Woodfill, B. von Herzen, and R. Zabih, Frame-rate robust
vision system. We made all the functional elements operate in stereo on a PCI board. Available: http://www.cs.cornell.edu/rdz/Papers/
synch with a single pixel clock as well as utilizing the same Archive/fpga.pdf

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM 25

[7] M. Kuhn, S. Moser, O. Isler, F. K. Gurkaynak, A. Burg, N. Felber, H. Junguk Cho received the B.S., M.S., and Ph.D.
Kaeslin, and W. Fichtner, “Efficient ASIC implementation of a real-time degrees from the School of Information and Com-
depth mapping stereo vision system,” in Proc. IEEE Int. Circuits Syst. munication Engineering, Sungkyunkwan University,
(MWSCAS ’03), Cairo, Egypt, vol. 3. Dec. 2003, pp. 1478–1481. Suwon, Korea, in 2001, 2003, and 2006, res-
[8] S. Kimura, T. Shinbo, H. Yamaguchi, E. Kawamura, and K. Nakano, pectively.
“A convolver-based real-time stereo machine (SAZAN),” in Proc. IEEE From 2006 to 2007, he was a Research Instructor
Comput. Soc. Conf. Comput. Vision Pattern Recognit., Fort Collins, CO, in the School of Information and Communication
vol. 1. Jun. 1999, pp. 457–463. Engineering, Sungkyunkwan University. In 2008, he
[9] A. Darabiha, J. Rose, and W. J. Maclean, “Video-rate stereo depth was with the Department of Computer Science and
measurement on programmable hardware,” in Proc. IEEE Comput. Soc. Engineering, University of California, San Diego, as
Conf. Comput. Vision Pattern Recognit., Madison, WI, vol. 1. Jun. 2003, a Postdoctoral Scholar. His research interests include
pp. 203–210. motion control, image processing, and embedded systems.
[10] Y. Jia, X. Zhang, M. Li, and L. An, “A miniature stereo vision machine
(MSVM-III) for dense disparity mapping,” in Proc. 17th Int. Conf.
Pattern Recognit., Cambridge, U.K., vol. 1. Aug. 2004, pp. 728–731. Xuan Dai Pham received the B.S. and M.S. deg-
[11] W. J. Dally and S. Lacy, “VLSI architecture: Past, present, and future,” rees in computer science from the University
in Proc. 20th Anniv. Conf. Adv. Res. VLSI, Atlanta, GA, Mar. 1999, pp. of Hochiminh City, Hochiminh City, Vietnam, in
232–241. 1994 and 2001, respectively, and the Ph.D. de-
[12] J. Diaz, E. Ros, F. Pelayo, E. M. Ortigosa, and S. Mota, “FPGA- gree in electrical and computer engineering from
based real-time optical-flow system,” IEEE Trans. Circuits Syst. Video Sungkyunkwan University, Suwon, Republic of Ko-
Technol., vol. 16, no. 2, pp. 274–279, Feb. 2006. rea, in 2008.
[13] M. Z. Brown, D. Burschka, and G. D. Hager, “Advances in computa- In 2008, he was with the Information Technology
tional stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 8, Faculty, Saigon Institute of Technology, Hochiminh
pp. 993–1008, Aug. 2003. City, Vietnam, as a Lecturer and a Researcher.
[14] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense His research interests include image processing and
two-frame stereo correspondence algorithms,” Int. J. Comput. Vision, computer vision.
vol. 47, pp. 7–42, 2002.
[15] L. D. Stefano, “A PC-based real-time stereo vision system,” Mach.
Graphics Vision Int. J., vol. 13, no. 3, pp. 197–220, Jan. 2004.
Kyoung Mu Lee received the B.S. and M.S. degrees
[16] M. Sonka, V. Hlavac, and R. Boyle, “3-D vision, geometry,” in Image
in control and instrumentation engineering from
Processing, Analysis, and Machine Vision, 2nd ed., Pacific Grove, CA:
Seoul National University, Seoul, Korea, in 1984
PWS Publishing, ch. 11, 1999.
and 1986, respectively, and the Ph.D. degree in elec-
[17] R. I. Hartley and A. Zisserman, “Epipolar geometry and the fundamental
trical engineering from the University of Southern
matrix,” in Multiple View Geometry. Cambridge, U.K.: Cambridge
California, Los Angeles, in 1993.
University Press, ch. 9, 2000.
From 1994 to 1995, he was with Samsung
[18] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach, 1st
Electronics Company Limited, Korea as a Senior
ed., Englewood Cliffs, NJ: Prentice Hall, 2002.
Researcher. Since September 2003, he has been with
[19] R. Zabith and J. Woodfill, “Non-parametric local transforms for com-
the Department of Electrical and Computer Engi-
puting visual correspondence,” in Proc. 3rd Eur. Conf. Comput. Vision,
neering, Seoul National University, as a Professor.
Stockholm, Sweden, May 1994, pp. 151–158.
His primary research interests include computer vision, image understanding,
[20] M. H. Lin and C. Tomasi, “Surfaces with occlusions from layered
pattern recognition, robot vision, and multimedia applications.
stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp.
1073–1078, Aug. 2004.
[21] G. Egnal and R. P. Wildes, “Detecting binocular half-occlusions: Empir-
ical comparisons of five approaches,” IEEE Trans. Pattern Anal. Mach. Sung-Kee Park received the B.S. and M.S. degrees
Intell., vol. 24, no. 8, pp. 1127–1133, Aug. 2002. in mechanical engineering from Seoul National Uni-
[22] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense versity, Seoul, Korea, in 1987 and 1989, respectively,
disparity maps from color stereo images, an efficient implementation,” and the Ph.D. degree from the Department of Auto-
Int. J. Comput. Vision, vol. 47, nos. 1–3, pp. 79–88, Apr. 2002. mation and Design Engineering, Korea Advanced
[23] R. Yang, M. Pollefeys, and S. Li, “Improved real-time stereo on Institute of Science and Technology, Daejeon, Korea,
commodity graphics hardware,” in Proc. Conf. Comput. Vision Pattern in 2000.
Recognit., Jun. 2004, pp. 36–43. Since 2000, he has been with the Center for
[24] C. L. Zitnick and T. Kanade, “A cooperative algorithm for stereo Intelligent Robotics, Korea Institute of Science and
matching and occlusion detection,” IEEE Trans. Pattern Anal. Mach. Technology, Seoul, Korea, as a Research Scientist.
Intell., vol. 22, no. 7, pp. 675–684, Jul. 2000. His current research interests include computer vi-
[25] Middlebury stereo vision page [Online]. Available: sion, robot vision, and intelligent robots.
http://vision.middlebury.edu/stereo/
[26] D. Murray and J. Little, “Using real-time stereo vision for mobile robot
navigation,” Auton. Robots, vol. 8, no. 2, pp. 161–171, 2000. Munsang Kim received the B.S. and M.S. degrees
[27] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense in mechanical engineering from Seoul National Uni-
disparity maps from color stereo images, an efficient implementation,” versity, Seoul, Korea, in 1980 and 1982, respectively,
Int. J. Comput. Vision, vol. 47, pp. 79–88, 2002. and the Ph.D. degree in engineering robotics from
the Technical University of Berlin, Berlin, Germany,
in 1987.
Since 1987, he has been with the Center for
Seunghun Jin received the B.S. and M.S. de-
Intelligent Robotics, Korea Institute of Science and
grees in electrical and computer engineering from
Technology, Seoul, Korea, as a Research Scientist.
Sungkyunkwan University, Suwon, Korea, in 2005
He has led the Advanced Robotics Research Center
and 2006, respectively. He is currently working to-
since 2000, and became the Director of the Intelli-
ward the Ph.D. degree at Sungkyunkwan University.
gent Robotics Development Program, Frontier 21 Program, since 2003. His
His research interests include image and speech
current research interests are design and control of novel mobile manipulation
signal processing, embedded systems, and real-time
systems, haptic device design and control, and sensor application to intelligent
applications.
robots.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.
26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Jae Wook Jeon (S’82–M’84) received the B.S.


and M.S. degrees in electronics engineering from
Seoul National University, Seoul, Korea, in 1984 and
1986, respectively, and the Ph.D. degree in electrical
engineering from Purdue University, West Lafayette,
IN, in 1990.
From 1990 to 1994, he was a Senior Researcher
at Samsung Electronics, Suwon, Korea. In 1994, he
joined the School of Electrical and Computer Engi-
neering, Sungkyunkwan University, Suwon, Korea,
as an Assistant Professor, where he is currently a
Professor. His research interests include robotics, embedded systems, and
factory automation.

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:4326 UTC from IE Xplore. Restricon aply.

You might also like