You are on page 1of 8

On Building an Accurate Stereo Matching System on Graphics Hardware

Xing Mei1,2 , Xun Sun1 , Mingcai Zhou1 , Shaohui Jiao1 , Haitao Wang1 , Xiaopeng Zhang2
1
Samsung Advanced Institute of Technology, China Lab
2
LIAMA-NLPR, Institute of Automation, Chinese Academy of Sciences
{xing.mei,xunshine.sun,mingcai.zhou,sh.jiao,ht.wang}@samsung.com,xpzhang@nlpr.ia.ac.cn

Abstract they formulate the disparity computation step as an ener-


gy minimization problem and solve it with slow-converging
This paper presents a GPU-based stereo matching sys- optimizers [14]; they extensively use segmented image re-
tem with good performance in both accuracy and speed. gions as matching units [17], surface constraints [6, 20] or
The matching cost volume is initialized with an AD-Census post-processing patches [2]. These techniques significant-
measure, aggregated in dynamic cross-based regions, and ly improve the matching quality at the cost of consider-
updated in a scanline optimization framework to produce able computation costs. Directly porting these techniques
the disparity results. Various errors in the disparity results to GPU or other multi-core platforms, however, is trick-
are effectively handled in a multi-step refinement process. y and cumbersome [4, 7, 18, 19]: large aggregation win-
Each stage of the system is designed with parallelism con- dows require extensive iterations over each pixel; some op-
siderations such that the computations can be accelerated timization, segmentation and post-processing methods re-
with CUDA implementations. Experimental results demon- quire complex data structures and sequential processing. As
strate the accuracy and the efficiency of the system: cur- a result, simple correlation-based techniques are more pop-
rently it is the top performer in the Middlebury benchmark, ular for GPU and embedded stereo systems. Designing a
and the results are achieved on GPU within 0.1 seconds. stereo matching system with good balance between accura-
We also provide extra examples on stereo video sequences cy and efficiency remains a challenging problem.
and discuss the limitations of the system. In this paper, we aim to meet this challenge by provid-
ing an accurate stereo matching system with near real-time
performance. Currently (August 2011), our system is the
1. Introduction top performer in the Middlebury benchmark. Briefly, we
integrate several techniques into an effective stereo frame-
Stereo matching is one of the most extensively studied work. These techniques guarantee high matching quality
problems in computer vision [11]. Two major concerns in without involving expensive aggregation or segmented re-
stereo matching algorithm design are the matching accura- gions. Furthermore, they show moderate parallelism such
cy and the processing efficiency. Although many algorithms that the entire system can be mapped on GPU for computa-
are introduced every year, the two concerns tend to be con- tion acceleration. The key techniques of our system include:
tradictory in the reported results: accurate stereo methods
are usually time consuming [6, 17, 20], while GPU-based An AD-census cost measure which effectively com-
methods achieve high processing speed with relatively low bines the absolute differences (AD) measure and the
disparity precision [10, 18, 24]. To the best of our knowl- census transform. This measure provides more ac-
edge, most top 10 Middlebury algorithms require at least 10 curate matching results than common individual mea-
seconds to process a 384 288 image pair, and the only two sures with robust aggregation methods. A similar mea-
GPU-based methods in top 20 are CostFilter [9] and Plane- sure has been adopted in a recent stereo algorithm [13].
FitBP [19], both working with near real-time performance.
Improved cross-based regions for efficient cost aggre-
The reason behind this contradiction is straightforward:
gation. Support regions based on cross skeletons are
some key techniques employed by accurate stereo algo-
first proposed by Zhang et al. [23], which allow fast
rithms are not suitable for parallel GPU implementation.
aggregation with middle-ranking disparity results. We
A careful analysis of the leading Middlebury algorithm-
enhance this technique with more accurate cross con-
s [6, 17, 20] shows that these algorithms have several com-
struction and cost aggregation strategy.
mon techniques in the matching process: they use large
support windows for robust cost aggregation [5, 16, 21]; A scanline optimizer based on Hirschmullers semi-
global matcher [2] with reduced path directions. is defined as the average intensity difference of p and pd in
RGB channels:
A systematic refinement process which handles vari-
1
ous disparity errors with iterative region voting, inter- (p, ) = (p) (pd) (1)
polation, depth discontinuity adjustment and sub-pixel 3
=,,
enhancement. This multi-step process proves to be
very effective for improving the disparity results. The AD-Census cost value (p, ) is then computed as fol-
lows:
Efficient system implementation on GPU with CUDA.
(p, ) = ( (p, ), )+
2. Algorithm (2)
( (p, ), )
Following Scharstein and Szeliskis taxonomy [11], our
where (, ) is a robust function on variable :
system consists of four steps: cost initialization, cost aggre-
gation, disparity computation and refinement. We present a
(, ) = 1 exp( ) (3)
detailed description of these individual steps.
2.1. AD-Census Cost Initialization The purpose of the function is twofold: first, it maps differ-
ent cost measures to the range [0, 1], such that equation (2)
This step computes the initial matching cost volume. wont be severely biased by one of the measures; second,
Since the computation can be performed concurrently at it allows easy control on the influence of the outliers with
each pixel and each disparity level, this step is inherently parameter .
parallel. Our major concern is to develop a cost measure To verify the effect of the combination, some close-up
with high matching quality. Common cost measures in- disparity results on the Middlebury data sets with AD, Cen-
clude absolute differences (AD), Birchfield and Tomasis sus and AD-Census are presented in Figure 1. Cross-based
sampling-insensitive measure (BT), gradient-based mea- aggregation is employed. Census produces wrong matches
sures and non-parametric transforms such as rank and cen- in regions with repetitive local structures, while pixel-based
sus [22]. In a recent evaluation by Hirschmuller and AD can not handle well large textureless regions. The com-
Scharstein [3], census shows the best overall results in local bined AD-Census measure successfully reduces the errors
and global stereo matching methods. Although the idea of caused by individual measures. For quantitative compari-
combining cost measures for improved accuracy seems s- son, AD-Census reduces Censuss non-occlusion error per-
traightforward, relatively less work has explored this topic. centage by 1.96% (Tsukuba), 0.4% (Venus), 1.36% (Ted-
Klaus et al. [6] proposed to linearly combines SAD and a dy) and 1.52% (Cones) respectively. And this improvement
gradient based measure for cost computation. Their dispari- comes at low additional computation cost.
ty results are impressive, but the benefits of the combination
are not clearly elaborated.
repetitive
Census encodes local image structures with relative or- structures
derings of the pixel intensities other than the intensity val-
ues themselves, and therefore tolerates outliers due to radio-
metric changes and image noise. However, this asset could textureless
regions
also introduce matching ambiguities in image regions with
repetitive or similar local structures. To handle this prob-
AD Census AD-Census
lem, more detailed information should be incorporated in
the measure. For image regions with similar local struc- Figure 1. Some close-up disparity results on Tsukuba and Teddy
tures, the color (or intensity) information might help allevi- image pair, which are computed with AD, Census, AD-Census
ate the matching ambiguities; while for regions with similar cost measures and cross-based aggregation. AD-Census measure
color distributions, the census transform over a window is produces proper disparity results for both repetitive structures and
more robust than pixel-based intensity difference. This ob- textureless regions.
servation inspires a combined measure.
Given a pixel p = (, ) in the left image and a dis-
2.2. Cross-based Cost Aggregation
parity level , two individual cost values (p, ) and
(p, ) are first computed. For , we use a 9 7 This step aggregates each pixels matching cost over a
window to encode each pixels local structure in a 64-bit support region to reduce the matching ambiguities and noise
string. (p, ) is defined as the Hamming distance in the initial cost volume. A simple but effective assump-
of the two bit strings that stand for pixel p and its corre- tion for aggregation is that neighboring pixels with simi-
spondence pd = ( , ) in the right image [22]. lar colors should have similar disparities. This assumption
has been adopted by recent aggregation methods, such as ue. The color difference is defined as (pl , p) =
segment support [16], adaptive weight [21] and geodesic max=,, (pl ) (p).
weight [5]. As stated in the introduction, these aggregation
methods require segmentation operations or expensive iter- 2. (pl , p) < , where (pl , p) is the spatial dis-
ations over each pixel, which are prohibitive for efficient tance between pl and p, and is a preset maximum
GPU implementation. Although simplified adaptive weight length (measured in pixels). The spatial distance is
techniques with 1D aggregation [13, 18] and color averag- defined as (pl , p) = pl p.
ing [4, 19] have been proposed for GPU systems, the aggre- The two rules pose limitations on color similarity and arm
gation accuracy usually degenerates. Recently, Rhemann length with parameter and . The right, up and bottom
et al. [9] formulated the aggregation step as a cost filter- arms of p are built in a similar way. After the cross con-
ing problem. By smoothing each cost slice with a guided struction step, the support region for pixel p is modelled
filter [1], good disparity results can be achieved. by merging the horizontal arms of all the pixels lying on
We instead focus on the cross-based aggregation method ps vertical arms (q for example). In the second step (Fig-
proposed recently by Zhang et al. [23]. We show that by im- ure 2(b)), the aggregated costs over all pixels are computed
proving support region construction and aggregation strate- within two passes: the first pass sums up the matching costs
gy, this method can produce aggregated results comparable horizontally and stores the intermediate results; the second
to the adaptive weight method with much less computation pass then aggregates the intermediate results vertically to
time. Another advantage over the adaptive weight method is get the final costs. Both passes can be efficiently computed
that an explicit support region is constructed for each pixel, with 1D integral images. To get a stable cost volume, the
which can be used in later post-processing steps. aggregation step usually runs 2 4 iterations, which can be
seen as an anisotropic diffusion process. More details about
up arm of p
the method can be found in [23].
The accuracy of cross-based aggregation is closely re-
support region of p
horizontal arms of q q
lated to parameter and , since they control the shape of
the support regions with the construction rules. Large tex-
left arm of p
p right arm of p tureless regions may require large and values to include
enough intensity variation, but simply increasing these pa-
rameters for all the pixels would introduce more errors in
dark regions or at depth discontinuities. We therefore pro-
bottom arm of p
pose to construct each pixels cross with the following en-
(a) Cross Construction hanced rules (we still use pixel ps left arm and the endpoint
pixel pl as an example):
1. (pl , p) < 1 and (pl , p1 + (1, 0)) < 1
p
= 2. (pl , p) < 1

horizontal vertical 3. (pl , p) < 2 , if 2 < (pl , p) < 1 .


(b) Cost Aggregation Rule 1 restricts not only the color difference between pl and
p, but also the color difference between pl and its predeces-
Figure 2. Cross-based aggregation: In the first step, an upright sor p1 +(1, 0) on the same arm, such that the arm wont run
cross is constructed for each pixel. The support region of pixel p across the edges in the image. Rule 2 and 3 allow more flex-
is modelled by merging the horizontal arms of the pixels (q for
ible control on the arm length. We use a large 1 value to
example) lying on the vertical arms of pixel p. In the second step,
the cost in the support region is aggregated within two passes along
include enough pixels for textureless regions. But when the
the horizontal and vertical directions. arm length exceeds a preset value 2 (2 < 1 ), a much
stricter threshold value 2 (2 < 1 ) is used for (pl , p)
Cross-based aggregation proceeds by a two-step process, to make sure that the arm only extends in regions with very
as shown in Figure 2. In the first step (Figure 2(a)), an similar color patterns.
upright cross with four arms is constructed for each pixel. For the cost aggregation step, we also propose a different
Given a pixel p, its left arm stops when it finds an endpoint strategy. We still run this step for 4 iterations to get stable
pixel pl that violates one of the two following rules: cost values. For iteration 1 and 3, we follow the original
method: the costs are first aggregated horizontally and then
1. (pl , p) < , where (pl , p) is the color differ- vertically. But for iteration 2 and 4, we switch the aggre-
ence between pl and p, and is a preset threshold val- gation directions: the costs are first aggregated vertically
Origina l Cros s
and then horizontally. For each pixel, this new aggregation Ada ptive we ight
order leads to a cross-based support region different from Enha nce d Cros s
10
the one in the original method. By altering the aggrega-

Avg. Error P e rce nta ge


tion directions, both support regions are used in the itera-
tive process. We find that such an aggregation strategy can
significantly reduce the errors at depth discontinuities. 5

The Tsukuba disparity results computed by the original


cross-based aggregation method and our improved method
are presented in Figure 3, which shows that the enhanced 0
All Non.Occ Dis c.
cross construction rules and aggregation strategy can pro-
duce more accurate results in large textureless regions and Figure 4. The average disparity error percentages in various re-
near depth discontinuities. gions for adaptive weight, the original cross-based aggregation
The WTA disparity results with three aggregation meth- method and our enhanced method.
ods (adaptive weight, the original cross-based aggregation
method and our enhanced method) are evaluated. For adap-
r (p, ) at pixel p and disparity is updated as follows:
tive weight, the parameters follow the settings in [21]. For
the original cross-based method, = 17, = 20 and 4 r (p, ) = 1 (p, ) + min(r (p r, ),
iterations are used. The average error percentages on the
four data sets (in non-occlusion, discontinuity and all re- r (p r, 1) + 1 ,
gions) are given in Figure 4. Our enhanced method pro- min r (p r, ) + 2 ) min r (p r, )

duces the most accurate results in all kinds of regions, es- (4)
pecially around depth discontinuities. Our implementation
of the adaptive weight method usually takes more than 1 where p r is the previous pixel along the same direc-
minute on CPU to produce the aggregated volume, while tion, and 1 , 2 (1 2 ) are two parameters for pe-
our method requires only a few seconds. nalizing the disparity changes between neighboring pixels.
In practice, 1 , 2 are symmetrically set according to the
color difference 1 = (p, p r) in the left image and
2 = (pd, pd r) in the right image [8]:

1. 1 = 1 , 2 = 2 , if 1 < , 2 < .

2. 1 = 1 /4, 2 = 2 /4, if 1 < , 2 > .

3. 1 = 1 /4, 2 = 2 /4, if 1 > , 2 < .


(a) original aggregation (b) improved aggregation
4. 1 = 1 /10, 2 = 1 /10, if 1 > , 2 > .
Figure 3. Comparison of the original cross-based aggregation where 1 , 2 are constants, and is a threshold value
method and our improved method on the Tsukuba image pair. Our for color difference. The final cost 2 (p, ) for pixel p and
aggregation method can better handle large textureless regions and disparity is obtained by averaging the path costs from all
depth discontinuities.
four directions:
1
2 (p, ) = r (p, ) (5)
2.3. Scanline Optimization 4 r

This step takes in the aggregated matching cost volume The disparity with the minimum 2 value is selected as pix-
(denoted as 1 ) and computes the intermediate disparity re- el ps intermediate result.
sults. To further alleviate the matching ambiguities, an opti-
mizer with smoothness constraints and moderate parallelis- 2.4. Multi-step Disparity Refinement
m should be adopted. We employ a multi-direction scan- The disparity results of both images (denoted as and
line optimizer based on Hirschmullers semi-global match- ) computed by the previous three steps contain outliers
ing method [2]. in occlusion regions and at depth discontinuities. After de-
Four scanline optimization processes are performed in- tecting these outliers, the simplest refinement method is to
dependently: 2 along horizontal directions and 2 along ver- fill them with nearest reliable disparities [11], which is only
tical directions. Given a scanline direction r, the path cost useful for small occlusion regions. We instead handle the
disparity errors systematically in a multi-step process. Each
step tries to remove the errors caused by various factors.
Outlier Detection: The outliers in are first detect-
ed with left-right consistency check: pixel p is an outlier if
(p) = (p ( (p), 0)) doesnt hold. Outliers are
further classified into occlusion and mismatch points, since
they require different interpolation strategy. We follow the
method proposed by Hirschmuller [2]: for outlier p at dis-
parity (p), the intersection of its epipolar line and (a) before outlier handling (b) after outlier handling
is checked. If no intersection is detected, p is labelled as
occlusion, otherwise mismatch. Figure 5. The disparity error maps for the Teddy image pair. The
Iterative Region Voting: The detected outliers should errors are marked in gray (occlusion) and black (non occlusion).
The disparity errors are significantly reduced in the outlier han-
be filled with reliable neighboring disparities. Most accu-
dling process.
rate stereo algorithms employ segmented regions for outlier
handling [2, 20], which are not suitable for GPU implemen-
tation. We process these outliers with the constructed cross-
based regions and a robust voting scheme.
For an outlier pixel p, all the reliable disparities in its
its cross-based support region are collected to build a his-
togram p with max + 1 bins. The disparity with the
highest bin value (most votes) is denoted as p . And the
total number of the reliable pixels is denoted as p =
=max (a) before discontinuity adjustment (b) after discontinuity adjustment
=0 p (). ps disparity is then updated with p if
enough reliable pixels and votes are found in the support
Figure 6. The errors around depth discontinuities are reduced after
region:
the adjustment step.
p (p )
p > , > (6)
p
where , are two threshold values. Sub-pixel Enhancement: Finally, a sub-pixel enhance-
To process as many outliers as possible, the voting pro- ment process based on quadratic polynomial interpolation
cess runs for 5 iterations. The filled outliers are marked is performed to reduce the errors caused by discrete dispar-
as reliable pixels and used in the next iteration, such that ity levels [20]. For pixel p, its interpolated disparity is
valid disparity information can gradually propagate into oc- computed as follows:
clusion regions.
Proper Interpolation: The remaining outliers are filled 2 (p, + ) 2 (p, )
= (7)
with a interpolation strategy that treats occlusion and mis- 2(2 (p, + ) + 2 (p, ) 22 (p, ))
match points differently. For outlier p, we find the nearest
reliable pixels in 16 different directions. If p is an occlu- where = (p), + = + 1, = 1. The final
sion point, the pixel with the lowest disparity value is se- disparity results are obtained by smoothing the interpolated
lected for interpolation, since p most likely comes from the disparity results with a 3 3 median filter.
background; otherwise the pixel with the most similar col- To verify the effectiveness of the refinement process, the
or is selected for interpolation. With region voting and in- average error percentages in various regions after perform-
terpolation, most outliers are effectively removed from the ing each refinement step are presented in Figure 7. The four
disparity results, as shown in Figure 5. refinement steps successfully reduce the error percentage in
Depth Discontinuity Adjustment: In this step, the dis- all regions by 3.8%, but their contributions are distinct for
parities around the depth discontinuities are further refined different regions: for non-occluded regions, voting and sub-
with neighboring pixel information. We first detect all the pixel enhancement are most effective for handling the mis-
edges in the disparity image. For each pixel p on the dis- match outliers; for discontinuity regions, the errors are sig-
parity edge, two pixels p1 , p2 from both sides of the edge nificantly reduced by voting, discontinuity adjustment and
are collected. (p) is replaced by (p1 ) or (p2 ) sub-pixel enhancement; most outliers in all regions are re-
if one of the two pixels has smaller matching cost than moved with voting and interpolation, and small errors due
2 (p, (p)). This simple method helps to reduce the s- to discontinuities and quantization are reduced by adjust-
mall errors around discontinuities, as shown by the error ment and sub-pixel enhancement. A systematic integration
maps in Figure 6. of these steps guarantees a strong post-processing method.
3. CUDA Implementation 1 2 1 2
10 30 34 17 20 6
Compute Unified Device Architecture (CUDA) is a pro- 1 2
gramming interface for parallel computation tasks on N- 1.0 3.0 15 20 0.4
VIDIA graphics hardware. The computation task is cod-
ed into a kernel function, which is performed concurrently Table 1. Parameter settings for the Middlebury experiments
on data elements by multiple threads. The allocation of the
threads is controlled with two hierarchical concepts: grid
and block. A creates a grid with multiple blocks, and imum errors both in non-occluded regions and near depth
each block consists of multiple threads. The performance of discontinuities. Compared to algorithms such as CoopRe-
the CUDA implementation is closely related to thread allo- gion [17], the results on the Tsukuba image pair are not
cation and memory accesses, which needs careful tuning in competitive. The Tsukuba image pair contains some very
various computation tasks and hardware platforms. Given dark and noisy regions near the lamp and the desk, which
image resolution and disparity range , we briefly lead to incorrect cross-based support regions for aggrega-
describe the implementation issues of our algorithm. tion and refinement.
Cost Initialization: This step is parallelized with We run the algorithm both on CPU and on graphics
threads. The threads are organized into a 2D grid and hardware. For the four data sets (Tsukuba, Venus, Ted-
the block size is set to 32 32. Each thread takes care of dy and Cones), the CPU implementation requires 2.5 sec-
computing a cost value for a pixel at a given disparity. For onds, 4.5 seconds, 15 seconds and 15 seconds respectively,
census transform, a square window is require for each pixel, while the GPU implementation requires only 0.016 second-
which requires loading more data into the shared memory s, 0.032 seconds, 0.095 seconds and 0.094 seconds respec-
for fast access. tively. The GPU-friendly system design brings an impres-
Cost Aggregation: A grid with threads is cre- sive 140 speedup in the processing speed. The average
ated for both steps of the aggregation process. For cross proportions of the GPU running time for the four compu-
construction, we set the block size to or , such that tation steps are 1%, 70%, 28% and 1% respectively. The
each block can efficiently handle a scanline. For cost aggre- iterative cost aggregation step and the scanline optimization
gation, we follow the method proposed by Zhang et al. [24], process dominate the running time.
which works similar to the first step. Each thread sums up a Finally, we test our system on two stereo video se-
pixels cost values horizontally and vertically in two passes. quences: a book arrival scene from the HHI database
Data reuse with shared memory is considered in both steps. (512 384, 60 disparity levels), and an Ilkay scene from
Scanline Optimization: This step is different from the Microsoft i2i database (320 240, 50 disparity levels). To
previous steps, because the process is sequential in the s- test the generalization ability of the system, we use the same
canline direction and parallel in the orthogonal direction. set of parameters as the Middlebury datasets, and no tempo-
A grid with or threads is created accord- ral coherence information is employed in the computation
ing to the scanline direction. threads are allocated for process. The snapshots for the two examples are presented
each scanline, such that path costs on all disparity level- in Figure 9, and a video demo that runs at about 10FPS
s can be computed concurrently. Synchronization between is available at http://xing-mei.net/resource/
the threads is needed for finding the minimum cost of the video/adcensus.avi. Our system performs reason-
previous pixel on the same path. ably well on these examples, but the results are not as con-
Disparity Refinement: Each step of the refinement pro- vincing as the Middlebury datasets: artifacts are visible
cess works on the intermediate disparity images, which can around depth boarders and occlusion regions.
be efficiently processed with threads. We briefly discuss the limitations of the current system
with the video examples. The disparity errors come from
4. Experimental Results several aspects: first, the support regions defined by the
cross skeleton rely heavily on color and connectivity con-
We test our system with the Middlebury benchmark [12]. straints. For practical scenes the cross construction process
The test platform is a PC with Core2Duo 2.20GHz CPU and can be easily corrupted by dark regions and image noise. S-
NVIDIA GeForce GTX 480 graphics card. The parameters mall regions without enough support area can be produced,
are given in Table 1, which are kept constant for all the data which brings significant errors for later computation steps
sets. such as cost computation and region voting. Bilateral fil-
The disparity results are presented in Figure 8. Our sys- tering might be used as a pre-process to reduce the noise
tem ranks first in the Middlebury evaluation, as shown in while preserving the image edges [1, 15]. Second, the well-
Table 2. Our algorithm performs well on all the data sets, designed multi-stage mechanism is a double-edged sword.
and gives the best results on the Venus image pair with min- It help us to get accurate results and remove the errors step
by step in a systematic way, but it also brings a large set [8] S. Mattoccia, F. Tombari, and L. D. Stefano. Stereo vision
of parameters. By carefully tuning individual parameters, enabling precise border localization within a scanline opti-
the disparity quality can be improved, but such a scheme mization framework. In Proc. ACCV, pages 517527, 2007.
is usually laborious and impractical for various real-world [9] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and
applications. A possible solution is to analyze the robust- M. Gelautz. Fast cost-volume filtering for visual correspon-
ness of the parameters with ground truth data and adaptively dence and beyond. In Proc. CVPR, 2011.
[10] C. Richardt, D. Orr, I. Davies, A. Criminisi, and N. A. Dodg-
set the unstable parameters with different visual contents.
son. Real-time spatiotemporal stereo matching using the
Automatic parameter estimation within an iterative frame-
dual-cross-bilateral grid. In Proc. ECCV, pages 63116316,
work [25] might also be used to avoid the tricky parameter 2010.
tuning process. [11] D. Scharstein and R. Szeliski. A taxonomy and evaluation of
dense two-frame stereo correspondence algorithms. IJCV,
5. Conclusions 47(1-3):742, 2002.
[12] D. Scharstein and R. Szeliski. Middlebury stereo evaluation -
This paper has presented a near real-time stereo system version 2, 2010. http://vision.middlebury.edu/
with accurate disparity results. Our system is based on sev- stereo/eval/.
eral key techniques: AD-Census cost measure, cross-based [13] X. Sun, X. Mei, S. Jiao, M. Zhou, and H. Wang. Stereo
support regions, scanline optimization and a systematic re- matching with reliable disparity propagation. In Proc. 3DIM-
finement process. These techniques significantly improve PVT, pages 132139, 2011.
the disparity quality without sacrificing performance and [14] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kol-
parallelism, which are suitable for GPU implementation. mogorov, A. Agarwala, M. Tappen, and C. Rother. A com-
Although our system presents nice results for the Middle- parative study of energy minimization methods for markov
bury data sets, applying it in real world applications is still random fields with smoothness-based priors. IEEE TPAMI,
30(6):10681080, 2008.
a challenging task, as shown by the video examples. Real
[15] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
world data usually contains significant image noise, rectifi-
color images. In Proc. ICCV, pages 839846, 1998.
cation errors and illumination variation, which might cause
[16] F. Tombari, S. Mattoccia, and L. D. Stefano. Segmentation-
serious problems for cost computation and support region based adaptive support for accurate stereo correspondence.
construction. And robust parameter setting methods are al- In Proc. PSIVT, pages 427438, 2007.
so important to produce satisfactory results. We would like [17] Z. Wang and Z. Zheng. A region based stereo matching algo-
to explore these topics in the future. rithm using cooperative optimization. In Proc. CVPR, pages
18, 2008.
Acknowledgment [18] Y. Wei, C. Tsuhan, F. Franz, and C. H. James. High per-
formance stereo vision designed for massively data parallel
The authors would like to thank Daniel Scharstein for the platforms. IEEE TCSVT, 99:111, 2010.
Middlebury test bed and personal communications. [19] Q. Yang, C. Engels, and A. Akbarzadeh. Near real-time
stereo for weakly-textured scenes. In Proc. BMVC, pages
References 8087, 2008.
[20] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister.
[1] K. He, J. Sun, and X. Tang. Guided image filtering. In Proc. Stereo matching with color-weighted correlation, hierarchi-
ECCV, 2010. cal belief propagation and occlusion handling. IEEE TPAMI,
[2] H. Hirschmuller. Stereo processing by semiglobal matching 31(3):492504, 2009.
and mutual information. IEEE TPAMI, 30(2):328341, 2008. [21] K.-J. Yoon and I.-S. Kweon. Adaptive support-weight ap-
[3] H. Hirschmuller and D. Scharstein. Evaluation of stere- proach for correspondence search. IEEE TPAMI, 28(4):650
o matching costs on images with radiometric differences. 656, 2006.
IEEE TPAMI, 31(9):15821599, 2009. [22] R. Zabih and J. Woodfill. Non-parametric local transforms
[4] A. Hosni, M. Bleyer, and M. Gelautz. Near real-time stereo for computing visual correspondence. In Proc. ECCV, pages
with adaptive support weight approaches. In Proc. 3DPVT, 151158, 1994.
2010. [23] K. Zhang, J. Lu, and G. Lafruit. Cross-based local stereo
[5] A. Hosni, M. Bleyer, M. Gelautz, and C. Rheman. Local matching using orthogonal integral images. IEEE TCSVT,
stereo matching using geodesic support weights. In Proc. 19(7):10731079, 2009.
ICIP, pages 20932096, 2009. [24] K. Zhang, J. Lu, G. Lafruit, R. Lauwereins, and L. V. Gool.
Real-time accurate stereo with bitwise fast voting on cuda.
[6] A. Klaus, M. Sormann, and K. Karner. Segment-based stere-
In Proc. ICCV Workshop, 2009.
o matching using belief propagation and a self-adapting dis-
[25] L. Zhang and S. M. Seitz. Estimating optimal parameter-
similarity measure. In ICPR, pages 1518, 2006.
s for mrf stereo from a single image pair. IEEE TPAMI,
[7] J. Liu and J. Sun. Parallel graph-cuts by adaptive bottom-up
29(2):331342, 2007.
merging. In Proc. CVPR, pages 2181 2188, 2010.
8.0
Dis c. Error All. Error
Non.Occ Error
s ca nline optimiza tion 8
2.4 s ca nline optimiza tion s ca nline optimiza tion

7.5
7
re gion voting
Non.Occ Error

re gion voting re gion voting

Dis c. Error
inte rpola tion

All. Error
2.2 inte rpola tion 6
dis c. a djus tme nt 7.0

dis c. a djus tme nt 5

2.0 s ub-pixe l inte rpola tion


6.5
s ub-pixe l dis c. a djus tme nt
4
s ub-pixe l

Figure 7. The average error percentages in non-occlusion, discontinuity and all regions after performing each refinement step.

Figure 8. Results on the Middlebury data sets. First row: disparity maps generated with our system. Second row: disparity error maps with
threshold 1. Errors in unoccluded and occluded regions are marked in black and gray respectively.

Avg. Tsukuba Venus Teddy Cones


Algorithm
Rank nonocc all disc nonocc all disc nonocc all disc nonocc all disc
Our method 5.8 1.0712 1.4810 5.7314 0.092 0.257 1.152 4.104 6.223 10.94 2.423 7.255 6.954
AdaptingBP [6] 7.2 1.1115 1.376 5.7915 0.103 0.214 1.444 4.226 7.066 11.87 2.484 7.929 7.327
CoopRegion [17] 7.2 0.873 1.161 4.612 0.114 0.213 1.546 5.1614 8.3110 13.011 2.7912 7.184 8.0116
DoubleBP [20] 9.7 0.885 1.293 4.765 0.137 0.4517 1.8711 3.533 8.309 9.632 2.9017 8.7824 7.7913

Table 2. The rankings in the Middlebury benchmark. The error percentages in different regions for the four data sets are presented.

Figure 9. Snapshots on book arrival and Ilkay stereo video sequences.