You are on page 1of 4

TEMPORAL STABILIZATION OF VIDEO OBJECT SEGMENTATION FOR 3D-TV

APPLICATIONS

Çiǧdem Eroğlu Erdem 1 , Fabian Ernst2 , Andre Redert 2 and Emile Hendriks 3

Momentum A. Ş., İstanbul, Turkey
1

Philips Research Laboratories, Eindhoven, The Netherlands
2
3
Faculty of Electrical Engineering, Delft University of Technology, The Netherlands
E-mail: cigdem@ieee.org, {fabian.ernst,andre.redert}@philips.com, E.A.Hendriks@ewi.tudelft.nl

ABSTRACT such that they will be less visible when the scene is rendered and
We present a method for improving the temporal stability of video viewed in stereo. The input to the proposed algorithm is a set of
object segmentation algorithms for 3D-TV applications. First, two temporally unstable object segmentation maps which is estimated
quantitative measures to evaluate temporal stability without ground- by any algorithm in the literature, for example by [3].
truth are presented. Then, a pseudo-3D curve evolution method,
which spatio-temporally stabilizes the estimated object segments
is introduced. Temporal stability is achieved by re-distributing ex-
isting object segmentation errors such that they will be less dis-
turbing when the scene is rendered and viewed in 3D. Our starting
point is the hypothesis that if making segmentation errors are in-
evitable, they should be made in a temporally consistent way for
3D TV applications. This hypothesis is supported by the exper-
iments, which show that there is significant improvement in seg-
mentation quality both in terms of the objective quantitative mea- (a) (b)
sures and in terms of the viewing comfort in subjective perceptual
tests. This shows that it is possible to increase the object segmen-
tation quality without increasing the actual segmentation accuracy.

1. INTRODUCTION

The task of building 3D models of a time-varying scene, using the
2D views recorded by uncalibrated cameras is an important but un- (c) (d)
solved task to provide content for the newly emerging 3D TV [1].
One approach to this problem is to segment the objects in the scene Fig. 1. (a), (b) First and last frames of “Flikken” sequence. (c)
and order their video object planes (VOPs) with respect to their in- The given temporally unstable video object planes for the “lady”
ferred relative depths. This approach gives a satisfactory sense of object (frames 8, 9, 10, 80, 81) from left to right. (d) Ground-truth
three dimensions when the scene is viewed in stereo. However, one VOPs for frames 8, 80 and 145.
of the most important requirements is the temporal stability of the
video object planes. The changes in video due to occlusions, cam- 2. MEASURES FOR TEMPORAL STABILITY
era motion, changing background and noise should not cause sud-
den changes (temporal instabilities) in the shape and color compo- Assuming that the color histogram of the object does not change
sition of the video object planes (see Fig.1(c)), as they cause very drastically from frame to frame, we can expect that a temporally
disturbing flickering effects when the scene is viewed in stereo in stable object segmentation exhibits small differences between the
3D TV applications. color histograms of the estimated video object planes (VOPs) [4].
Many object segmentation and tracking algorithms exist in the One shortcoming of the histogram measure is that it cannot distin-
literature [2]. These algorithms may loose temporal stability un- guish if a portion of the object is removed and replaced by another
der difficult conditions, e.g. when the colors of the object and the block of the same color belonging to the background. Therefore,
background are similar causing missing object boundaries or when we can also require that the shape of two successive video object
the motion can not be estimated with sufficient accuracy. In this planes should not differ drastically. Hence, histogram and shape
paper we try to answer the question: “If making object segmenta- differences between successive video object planes are two candi-
tion errors are inevitable, how can we conceal them in our appli- dates for evaluating the temporal stability of object segmentation.
cation?” Our approach is based on the hypothesis that if making Histogram Measure: The difference between two histograms
segmentation errors are inevitable, they should be done in a tem- can be calculated using the chi-square measure as follows [4]:
porally consistent way to increase the viewing comfort in 3D TV B
X [Ht−1 (j) − Ht (j)]2
applications. To this effect, we propose a pseudo-3D curve evolu- dχ2 (Ht−1 , Ht ) = , (1)
tion technique, which distributes the existing segmentation errors j=1
Ht−1 (j) + Ht (j)

0-7803-8554-3/04/$20.00 ©2004 IEEE. 357

Pseudo-3D Generalization of Curve Evolution object planes at frames t and t − 1. where the region to be segmented can be characterized by a predetermined set of distinct ∂Vyt (k) → − → − − → features such as mean. length of the curve weighted by a constant α. we apply the curve evolution vertically by increasing amounts. we first stack them together so that a three-dimensional “object Shape Measure: One way to represent the “shape” of a video blob” in x-y-t space is formed (see Fig. on the “Flikken” sequence (see Fig. x-t and where the parameters u and v denote the mean gray level inten.1. B is the number of bins in the histogram. Background Theory where O n denotes the “object blob” at iteration n and Pyt denotes the processing of each y-t cross-section of the “object blob” using Region-based curve evolution techniques have been used for im. 2 (a). the curve C is evolved in such a way that it will eventually snap to the desired object boundary along the line connecting the vertices V(k) and V(k − 1). → − This idea is illustrated in Fig. The aim is to minimize angle shows the x-t cross section and the vertical rectangle shows the following energy function: I the y-t cross section. both due to the oscillatory (direction chang- equation describing the motion of the curve is obtained as follows: ing) motion of the object.k N k+1. the difference calculation (2) should be repeated x-y domain (at each t value). A prior normalization of the histograms may be Given a set of temporally unstable video object segmentation maps. face is initialized so that it includes this “object blob”. necessary (see [4] for details). t)]. and also due to the PK fact that the evolving surface is represented by polygonal patches j=1 kθt−1 (j) − θt (j)k which leaves out high curvature segments. y) denotes a pixel intensity. This transforms the 3D “object volume” into I(x. yt cross section of the “object blob”. In order sections (slices) of the “object blob” in x-y-t space. → − a polygonal representation for C yt (s. By using this pseudo-3D approach.k−1 N k. nected black blobs still exist after motion compensation because rection parallel to the normal vector drawn to the boundary at that of the natural topology of the object. fek. (3) 2 − C the x-y. the distance between them is calculated smoothing effect is expected both due to the curvature term. which as follows: tries to make the surface as smooth as possible. which are one dimensional vectors describing the shapes ally converge to a smoothed version of the 3D object volume. and if it is ter obtaining the TAFs belonging to the video objects in successive allowed to evolve so as to minimize its energy (3). (5) Au Av black regions in any y-t or x-t cross-section. Af. Let us parameterize tively. not consist of a single connected group of black pixels as can be tion of the energy gradient. and then the minimum of the technique for each x-t and y-t cross section (slice) of the “object differences should be taken. 2 (a)). which may be in.k−1 denote → − → − the interpolated speed function (4) and the outward normal vector trary initialization as denoted by C . The (denoted by θt and θt−1 ). respec- ∂R.k−1 + fek+1. t): age segmentation in the literature [6. a polygonal im. which makes the implementation easier and faster. 1). The over- the curvature κ of the boundary defined at that boundary point. θt ) = (2) This 3D smoothing approach can be converted into a combina- 2πK tion of simpler 2D smoothing steps by considering different cross where K is the total number of points on the boundary. In all flowchart of the proposed pseudo-3D smoothing algorithm is the above equation I(x. = fek. blob” iteratively as follows: 3. y) N − ακ N . t) = [x(s. y) − v a more uniform block. denote areas inside and outside the curve. and texture. 7]. The turning angle function (TAF) plots the counter clockwise its surface using a surface evolution approach. The functions Pxt and Pxy are defined similarly. and has The proposed pseudo-3D temporal stabilization algorithm is tested been adopted and generalized to pseudo-3D in this paper.2. We propose to object is to use the turning angle function of the boundary pixels improve the temporal stability of this “object blob” by smoothing [5]. after shifting one of the turning angle functions horizontally and In order to achieve temporal stability. After some manipulations (see [6]) the observed in Fig. (7) ∂t ferred from the data. If we apply for this function to be independent of rotation and of the choice of the curve evolution equation (4) to the segmentation maps in the the starting point. 2 (b). y-t slices does not produce significant changes in the experimental → − sities inside and outside the curve C and the second term is the results.where Ht and Ht−1 denote the RGB color histograms of the video 3. (4) ing the binary object segmentation maps to align them with respect dt µ ¶ to the first frame. 4. where the horizontal rect- the curve as C (s.k−1 and N k. → − The effect of the motion can be eliminated by motion compensat- d C (s. t) y(s. y) = (u − v) + . If multiple discon- which tells us to move each boundary point on the curve in a di. the curve evolution has to be point using a speed function derived from the image statistics and applied for each disconnected region of significant size. which is an extract from a 358 . we can achieve spatial smoothness. y) − u I(x. until the shape convergences. TEMPORAL STABILIZATION OF OBJECT SEGMENTATION MAPS On+1 = Pyt (Pxt (Pxy (On ))) (6) 3. The order of processing in the x-y. variance. If a polygonal sur- angle from the x-axis as a function of the boundary length [5]. Av given in Fig. Our aim is to move Sometimes the y-t or x-t cross sections of the “object blob” do every point on the curve such that it moves in the negative direc. t) → − → − = f (x. x-t and y-t slices of the “object volume” iteratively. The reader should refer to [6] for details. thus minimizing the number of separate f (x. A simple image segmentation problem is the case where there where Vyt (k) denotes a vertex on the polygonal boundary in the → − are just two types of regions in the image. d(θt−1 .k − ακ N b . we can 1 obtain spatio-temporally stable object segmentation by processing E = − (u − v)2 + α → ds. and Au . EXPERIMENTAL RESULTS plementation of the above curve evolution equation has been pre- sented. 5 (a). Starting with an arbi. and the natural topology of the object. it will eventu- frames. In [7].

Although the accuracy of segmentation objects in the scene and then by placing each object at different in- decreases in several frames after temporal stabilization. . Fig. How. 4 (b). The are considerably smaller after spatio-temporal smoothing. We can observe that Objective Evaluation of the Results: In order to quantify unwanted high-curvature parts and missegmented background re. 7 (b). 4. correctly signal the frames where the missegmented background pixels in the x-y domain which is a large portion of the object has been removed from or added to marked by the horizontal line in Fig. The initial segmentation of the car. this shows that it is possible to increase the quality of using a set-up with glasses. 6. 102. 9. The segmentation of the objects in this realistic se- quence is particularly difficult since the object and background colors are quite similar. as well as the scores for the shape measure. proved the quality of 3D viewing in 3D-TV applications. several frames of the Flikken sequence are shown to see whether the proposed temporal stabilization algorithm im- after applying the complete spatio-temporal smoothing algorithm. 10. 47. the improvement in the temporal stability of the smoothed video gions are eliminated easily. Therefore. 1). which correspond to the legs of the lady. the plot of the histogram difference measure is be seen due to the motion of the lady. Motion compensation is utilized to make ing with the proposed algorithm. Fig. 48. 85. the x-y domain). Subjective (Perceptual) Evaluation of the Results: In order In Fig. (a) The illustration of spatio-temporal smoothing. We can see in Fig. eas. Then. 3]. where large peaks at frame curvature part in the x-t domain corresponds to the elimination of numbers such as 9. which implies mation is added to a given 2D video sequence by segmenting the a better temporal stability.4 (determined experimentally). the mean and the variance of the histogram and shape measures ever. 5 (a). 5 (c) and (b). the overall ferred depths [3]. they are evaluated using the histogram and shape the “lady” object for a fixed y value is shown (after processing in measures. to obtain a reference (R) segmentation. tom row shows the object segmentation maps after convergence of the curve. The gray-shaded regions stacked one after another represent the object segmentation maps in each frame. across 168 frames of the segmentation maps of the “lady” object eral frames in the x-y domain are provided. The results on 168 frames of the “walking lady” ob. 3. (Bottom Row) The results after processing in the x-y domain. The top row shows the before and after x-t smoothing. the histogram dif- segmentation map of frame 111 in the spatial (x-y) domain before ference measure is a good indicator of the instants where we loose and after x-t processing. shows the y-t smoothing results. (b) The flowchart of the spatio- temporal smoothing using curve evolution. The objects in the Flikken sequence were also hand-segmented as explained below. where the temporal unstability caused by the legs is eliminated. 5 (b). Fig. the left and right views are rendered using decrease in segmentation accuracy for 168 frames was marginal a simple first-order extrapolation method for the disoccluded ar- (a 3% increase in the average number of missegmented pixels). we can see that most of the peaks have been eliminated. Processing in the x-y domain: (Top Row) The original segmentation maps for frames 5. indicat- effect of y-t smoothing in the spatial domain is shown in Fig. we also We can see from the bottom row that the smoothed results do not carried out a set of perceptual evaluation tests. after process- and then towards right. the smoothing results for sev. and 110. Then the left and right sequences are displayed to the viewer However. In Fig. the (a) (b) walking lady and the man objects are carried out using the algo- rithm [8. The depth infor- display sudden changes as compared to the top row. observed in the x-y domain. In Fig. In Fig. . TV movie. 3. Fig. We can observe that which actually introduces a loss of segmentation accuracy. 20. ture lines are eliminated. The weight of the curvature term in (3) is selected as α = 0. who first walks towards left given for the temporally stable video object planes. object segmentation without decreasing the segmentation errors. 2. . a y-t cross section is given temporal stability. We can see that some high curva. 4(b) shows the the video object plane (see Fig. this is not noticeable when the scene is viewed in 3D. which were discussed in Section 2. (a) Processing in the x-t domain: The x-t cross-section ject will be presented here. 5 (d). as seen in Fig. for a fixed x value. Two disconnected group of black regions can In Fig. Table 1 summarizes ratio of the mean and variance of the two plots. ing that the segmentation maps are more temporally stable. with which the scenes ob- 359 . The bottom figure shows the result of x-t curve In Fig. an x-t cross-section of object planes. 4 (a) that the elimination of the high number is given for the “lady” object. If we compare the two plots (a) the cross sections more aligned. the plot of the histogram measure versus the frame evolution. (a) (b) Fig. (b) Effects of x-t processing as given temporally unstable object segmentation maps and the bot. 4 (a). 7(a).

Subjective evaluation tests indicate that there is an improvement in the perceived quality of the scene when viewed in Fig.12 0.9 4. The tests where the two compared se. and A. pp. INTER−FRAME VOP HISTOGRAM DIFFERENCES INTER FRAME VOP HISTOGRAM DIFFERENCES AFTER SMOOTHING Mitchell. The se. ICOB’03 Workshop.” Circuits. K.06 segmentation via coupled curve evolution equations. named as Test1 .52 Table 2.69 After smoothing 1. 2003. Chew. A. that if there are inevitable object segmentation errors. Systems and Signal Processing. respectively.RR. Op de Beeck and A.-RS UU. which also validates the effectiveness of the proposed quan- titative measures. (a) (b) (c) Tests 1-2 Tests 3-4 Tests 5-7 Tests 8-9 AB pairs -RU. Kedem. 7. and A.” IEEE Transactions on Image Processing. Arkin. respectively. 5.” IEEE Trans.22 Ratio : Bef ore Af ter 7 182 3. SS) are mentation. where g(. Conf. pp. pp. Huttenlocker. a pseudo-3D region-based curve evolution technique for tem- porally stabilizing a set of estimated video object planes has been introduced.02 0. M. they should bilization.) ter motion compensation. Sankur. The ob. Pattern Analysis and Machine Intelligence.52.US Av. which indicates that S. used for checking the reliability of the tests. Int. 143–183. 7.08 [6] A. The perceptual evaluation results for fourteen observers are no. L. P. Tekalp. (c) The y-t cross-sections after y-t pro. REFERENCES us a total of nine combinations. E. Hence.” in Proc.76 38.52 696. an observer was algorithm which optimizes the temporal stability measures directly shown two stereo sequences A and B one after another. [1] M. [3] F. 1991. quences: A review. 2002. Unal. and K. CONCLUSIONS AND FUTURE WORK x-y domain for frames 49. Lu. summarized in Table 2. ordering of the three cases as: g(R) > g(S) > g(U ).” in Proceedings of IEEE International Conference (a) (b) on Image Processing (ICIP).12 0. 0. Erdem. 20. Top Row: Original video object planes for frames 0. [5] E. (b) Two y-t cross-sections af.SS -SU. 2002. Histogram means and variances have been scaled by 103 and 106 .59 0. 896–899. “Three dimensional video for the server was asked to select one of the choices: “B is significantly home.” in Proc. Zhang and G. denotes the perceived quality of the rendered sequence. 50. vol. “Performance measures an average value of zero. Willsky.02 [7] G.1 13. vol. Fig. During the perceptual tests. The average score of the tests that com. P.83 9. U and S. Bottom Row: The same frames after temporal sta. van Overveld. S.” in Proceedings of VOPs of the “lady” object before (a) and after (b) temporal sta- European Conference on Computer Vision.08 0. Score 1. The experiments support our initial hypothesis 100 and 150. 195–216. 2001. be re-distributed in a temporally stable way. quences A and B can be one of the three cases R.1 0. 2. pp. Ernst. The average scores in Table 2 also indicate a quality the “lady” object for a fixed x value. Redert. Obtaining temporally stable video object segmentation maps is im- portant for comfortable viewing in 3D TV applications. The five options are assigned the scores -2 [2] D.Test9.04 0.Yezzi. 2004. bilization versus frame number. vol. “A fully global approach to image 0. It has been shown by experiments that the proposed algorithm significantly improves the temporal stability in terms of two quantitative objective measures based on histogram and shape differences. no. cessing. RR. Processing in the y-t domain: (a) A y-t cross-section of viewed in 3D.90 36. P. The ratio of the objective evaluation scores for the lady object before and after temporal stabilization. “Segmentation of moving objects in image se- to 2 from left to right. giving 6. 50 and 51. 3D.06 0. M. 209–215. since they should have [4] C. 6. Krim. B. is under development. D.” Journal of Vi- 0. 2002. 0. Ernst. (d) Effects of y-t domain smoothing as observed in the 5. when Fig. 2001. Histogram Measure Shape Measure Mean Var Mean Var Before smoothing 11. “An efficient computable metric for comparing polygonal shapes. The histogram difference measure between successive [8] F.Yezzi.4 Table 1. Subjective evaluation scores for the Flikken sequence.UR SR. “A vertex-based representation of ob- 0 0 20 40 60 80 100 FRAME NUMBER 120 140 160 0 0 20 40 60 80 100 FRAME NUMBER 120 140 160 jects in an image.05 0. “Dense structure-from- motion: An approach based on segment matching. 1. 188–191. out increasing the segmentation accuracy. and J. In this pa- per. vol. we conclude that it is possible to increase the object segmentation quality with- tained by the unstable (U) and stable (S) object segmentation re.04 sual Communication and Image Representation. vol. 360 . Tsai. 13. 0. pp.64 3. for video object segmentation and tracking. B.87 158. the stabilized re- sults are perceived as being better than the unstable results. An object segmentation sults are compared. and A. On Augmented Virtual Environments and worse / slightly worse / the same as / slightly better / significantly Three-Dimensional Imaging. (d) pare S and U is 0. H. Wilinski.08 0. “2d-to-3d video conversion based on time-consistent seg- quences A and B are exactly the same (such as UU. better than sequence A. 13.