You are on page 1of 11

Robotics and Autonomous Systems 55 (2007) 597–607

www.elsevier.com/locate/robot

A variational method for the recovery of dense 3D structure from motion


Hicham Sekkati, Amar Mitiche ∗
Institut national de la recherche scientifique, INRS-EMT, Place Bonaventure, 800, rue de la Gauchetiere ouest, Suite 6900, Montreal, Quebec, Canada, H5A 1K6

Received 3 March 2005; received in revised form 16 November 2006; accepted 16 November 2006
Available online 19 December 2006

Abstract

The purpose of this study is to investigate a variational formulation of the problem of three-dimensional (3D) interpretation of temporal
image sequences based on the 3D brightness constraint and anisotropic regularization. The method allows movement of both the viewing
system and objects and does not require the computation of image motion prior to 3D interpretation. Interpretation follows the minimization
of a functional with two terms: a term of conformity of the 3D interpretation to the image sequence first-order spatio-temporal variations, and a
term of regularization based on anisotropic diffusion to preserve the boundaries of interpretation. The Euler–Lagrange partial differential equations
corresponding to the functional are solved efficiently via the half-quadratic algorithm. Results of several experiments on synthetic and real image
sequences are given to demonstrate the validity of the method and its implementation.
c 2007 Published by Elsevier B.V.

Keywords: Image sequence analysis; Optical flow; 3D from 2D; Anisotropic regularization

1. Introduction seeks to compute a depth and 3D motion over the whole im-
age positional array, has been significantly less researched in
The recovery of the shape of real objects from image spite of the many studies on dense estimation of image motion
motion, referred to as structure from motion perception, is a [6–8], understandably so, however, because practical applica-
fundamental problem in computer vision. It occurs in many tions have appeared latterly.
useful applications such as robotics, real object modeling, 2D- This study addresses the problem of dense recovery
to-3D film conversion, augmented reality rendering of visual of structure from motion, more precisely the problem of
data, internet and medical imaging, among others. estimating dense maps of depth and 3D motion from a temporal
Computer vision methods which compute image motion sequence of monocular images. One must differentiate this
before recovering structure are known as two-stage or indirect problem from the problem of estimating depth in stereoscopy
methods. Those which compute structure without prior image ([9,10], for instance). Although one can argue that the
motion estimation are known as direct methods. Direct methods two problems are conceptually similar because one can be
use an explicit model of image motion in terms of the 3D considered a discrete version of the other, their input and the
variables to be estimated. For instance, in this study we processing of this input are dissimilar. As indicated in [11],
assume that environmental objects are rigid and we express one can readily see a difference from an abstract point of
image motion in terms of the parameters of rigid motion and view, because stereoscopy implies the geometric motion of
depth. displacement between views, and image temporal sequences
One can also make a distinction between dense and sparse the kinematic notion of motion of the viewing system and
recovery of structure from motion. Sparse recovery, where viewed objects. A displacement is defined by an initial position
depth is computed at a sparse set of points of the image and a final position, intermediate positions being immaterial.
positional array, has been the subject of numerous well- Consequently, the notions of time and velocity are irrelevant.
documented studies [1–5]. Dense recovery, however, where one With motion, in contrast, time and velocity are fundamental
dimensions. One can also readily see a difference from a
∗ Corresponding author. more practical point of view, because both the viewing system
E-mail address: mitiche@inrs-telecom.uquebec.ca (A. Mitiche). and viewed objects can move when acquiring temporal image

c 2007 Published by Elsevier B.V.


0921-8890/$ - see front matter
doi:10.1016/j.robot.2006.11.006
598 H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607

sequences and all motions that occur enter 3D interpretation, The functional contained a term of conformity of the 3D
which is not the case in stereoscopy. At the least, this interpretation to the image sequence spatio-temporal variations,
leads to computational complications. For instance, assuming a term of regularization of depth, and a term of regularity
that environmental objects are rigid and images are sized of segmentation boundaries. The Euler–Lagrange equations
640 × 480, there can be over 2 million variables to evaluate of minimization were solved by an iterated greedy algorithm
at each instant of interpretation (six for the screw of motion which included active curve evolution and level sets.
function and one for depth, an evaluation at each point of the The purpose of this study is to investigate a new varia-
image positional array). The number of these variables can tional method of 3D interpretation of image sequences, intro-
be reduced if some form of motion-based segmentation enters duced briefly in [26], which brings to bear the anisotropic reg-
3D interpretation [11,12]. Such a process, however, is itself ularization theory developed in [7] for image motion estima-
computationally quite intensive. tion by a variational method. The formal analysis in [7] de-
There are limitations common to all methods that seek a termined diffusion functions from anisotropic smoothing con-
3D interpretation from a sequence of image spatio-temporal ditions which preserved image motion boundaries. The anal-
variations: ysis also led to the half-quadratic algorithm for solving effi-
ciently the equations of the variational image motion estima-
(a) Because motion-based 3D interpretation requires percep- tion method. Here we present a direct method of 3D interpreta-
tion of motion, an obvious limitation is that such an inter- tion of temporal image sequences based on the minimization
pretation is not possible when the surfaces to be interpreted of an energy functional containing two characteristic terms:
are not in motion relative to the viewing system or have one term evaluates the conformity of a rigidity-based 3D inter-
non-textured projections, i.e., exhibit no sensible spatio- pretation to the sequence spatio-temporal variations; the other
temporal variations. For such surfaces, depth is indetermi- is an anisotropic regularization term on 3D interpretation us-
nate unless propagated from neighboring textured surfaces. ing the Aubert function. Minimization of the objective func-
Regardless of how underlying computations manage inde- tional follows the Euler–Lagrange equations solved via the half-
terminacy, schemes that aim to preserve the 3D interpre- quadratic algorithm adapted to these equations. The method al-
tation boundaries prevent, to some extent, the propagation lows movement of both the viewing system and environmental
of interpretation to non-textured surfaces from neighboring objects as in [11,12]. It generalizes [26] to general rigid motion,
textured surfaces. relaxes the assumption of piecewise constant 3D variables [12],
(b) To any interpretation consistent with a sequence, spatio- and is an alternative to level sets [11] of significantly lower
temporal variations corresponds a one-parameter family of computional cost.
solutions differing in scale. The remainder of this paper is organized as follows. In
(c) Direct implementation of a solution that refers to image the next section we describe the imaging model and state
motion presumes small-range motion. For large motions, the equations of the motion field induced by rigid motion.
some form of multiresolution processing must enter the From these, we write the brightness constraint equation to
interpretation. be used subsequently. Section 3 is devoted to recovering the
Within the general topic of motion analysis, relatively few 3D interpretation for the case of a translational 3D motion.
studies have addressed the subject of dense 3D interpretation Section 4 generalizes the method for general rigid motion,
of temporal image sequences. Most of these do not allow resulting in a model for solving the most general case: arbitrary
for object motion, assuming that the viewing system alone motion of a viewing system relative to an environment of
moving rigid objects. Section 6 describes several experimental
is in movement [13–21]. As indicated in [11], when the
results with both synthetic and real image sequences.
viewing system moves but environmental objects do not, the
problem is simpler and 3D interpretation can be inferred by an 2. Rigidity-based constraint equation of direct methods
indirect method which recovers depth following least-squares
estimation of 3D motion. The interpretation of scenes where The image acquisition process is modeled by central
object movement can occur was investigated in [11,12,22,23]. projection through the origin of a direct orthonormal reference
The variational method in [12] follows the minimum system (Fig. 1). Let B be a rigid body in motion with
description length principle [24,25]. Environmental objects are translational velocity T = (t1 , t2 , t3 ) and rotational velocity
assumed to be rigid, and this direct method seeks to partition ω = (ω1 , ω2 , ω3 ). Let P ∈ B with coordinates (X, Y, Z ) and
the image domain into regions of constant depth and 3D motion projection p with image coordinates (x, y). The coordinates of
by minimizing an objective function that measures the cost in P and p are related via the projective relations:
bits to code the length of the boundaries of the partition and  t
X Y P
the conformity of the 3D interpretation to the image sequence p = (x, y, f )t = f , f , f = f (1)
spatio-temporal variations within each region of the partition. Z Z Z
The necessary conditions for a minimum of the objective where f is the focal length. Derivation of (1) with respect
function result in a large-scale system of nonlinear equations to time and subsequent substitution of the expression of the
solved by continuation. velocity Ṗ of P in terms of the translational and rotational
The purpose in [11] was to investigate joint rigid 3D-motion velocity of rigid motion, Ṗ = T + ω × OP, gives the expression
segmentation and 3D interpretation by functional minimization. of image motion (u, v) of p at position (x, y) [27]:
H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607 599

the half-quadratic algorithm. For a transparent presentation,


which would otherwise be unnecessarily cluttered with the
symbols of all the 3D motion parameters, we will treat the
particular case of translating environmental surfaces (Section 3)
before dealing with, by simple generalization, the case of
surfaces undergoing general rigid motion (Section 4).

3. 3D interpretation in the case of translational motion

In the case where the imaged environmental surfaces


all translate relative to the viewing system, we seek a 3D
interpretation of the image sequence that minimizes the
following energy functional:
ZZ
E(τ ) = [(It + s · τ )2 + λ(Φ(k∇τ1 k)

Fig. 1. Camera model: perspective projection.
+ Φ(k∇τ2 k) + Φ(k∇τ3 k))]dxdy. (6)

f 2 + x2

1 xy In (6) the first term in the integrand is the term of conformity to
u = ( f t1 − xt3 ) − ω1 + ω2 − yω3


Z f f data according to (4). The other term is a regularization term.
2 2 (2) Function Φ is defined as in [7] so that it realizes anisotropic
1 f +y xy
v = ( f t2 − yt3 ) − ω1 + ω2 + xω3


Z f f diffusion √to preserve the boundaries of the interpretation:
Φ(s) = 2 1 + s 2 − 2.
where Z is the depth of P. Eq. (2) represents a parametric model The Euler–Lagrange partial differential equations corre-
of image motion in terms of the parameters of rigid motion and sponding to (6) are:
depth.
Φ (k∇τ1 k)
 0 
Let I : Ω ×]0, T [ be an image sequence, where Ω is an

 λ div ∇τ 1 = 2s1 (It + s · τ )
open set of R2 representing the image domain, and ]0, T [ is



 k∇τ1 k
the interval of duration of the sequence. The assumption that Φ (k∇τ2 k)

  0 
the brightness recorded from a point on a surface in the scene λ div ∇τ2 = 2s2 (It + s · τ ) (7)
 k∇τ2 k
does not change during motion leads to the Horn and Schunck

λ div Φ (k∇τ3 k) ∇τ3 = 2s3 (It + s · τ )
0

  

gradient equation [6], which is also referred to as the optical

flow constraint: k∇τ3 k

dI with boundary conditions


= It + ẋ I x + ẏ I y = 0 (3)
dt ∂τ1 ∂τ2 ∂τ3
= 0, = 0, =0
where I x , I y and It are the spatio-temporal derivatives of I . ∂n ∂n ∂n
Substitution of (2) in (3) gives the rigidity-based constraint where n indicates the unit normal function to the boundary ∂Ω
equation of direct methods of 3D interpretation [15]: of Ω .
It + s · τ + q · ω = 0 (4) 3.1. Energy minimization: Half-quadratic algorithm
where vectors τ, s, and q are given by
  A discretization of (7) yields a large system of nonlinear
f Ix equations that is difficult to solve in general. Rather than
T
τ= , s= f Iy , solving such a system, we will use the half-quadratic
Z −x I − y I
x y minimization algorithm proposed in [7], where it was used for
 y  image restoration and optical flow estimation. With the half-
− f I y − (x I x + y I y ) quadratic algorithm, we minimize (6) via the minimization of
 f 
q =  f I + x (x I + y I )  .
 
(5) another functional which, in our case, is given by:
x x y 
 f ZZ
−y I x + x I y E (τ, b) =

((It + s · τ )2 + λC ∗ (τ, b))dxdy (8)

In the next two sections, we will state 3D interpretation
where
as a variational problem where the energy functional to be
minimized contains a term of conformity of the interpretation C ∗ (τ, b) = bτ1 k∇τ1 k2 + ψ(bτ1 ) + bτ2 k∇τ2 k2
to constraint (4) and a term of regularization based on
+ ψ(bτ2 ) + bτ3 k∇τ3 k2 + ψ(bτ3 )
an anisotropic smoothing function. We will also write the
Euler–Lagrange equations corresponding to the minimization b = (bτ1 , bτ2 , bτ2 )t is a field of auxiliary variables, and ψ is
of the energy functional and indicate how to solve them using a strictly decreasing convex function. The change from E to
600 H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607

E ∗ is justified by a duality theorem [7]. Function ψ is related Table 1


implicitly to function Φ in such a way that, for all real a > 0, Minimization by the half-quadratic algorithm
the unique value b which minimizes ba 2 = ψ(a) is given τ (0) ≡ 0
by [28]: Repeat
b(n+1) = arg min{b} E ∗ (τ (n) , b)
b = Φ 0 (a)/2a. (9) τ (n+1) = arg min{τ } E ∗ (τ, b(n+1) )
n =n+1
This result, and the fact that Until covergence
- for a fixed τ, E ∗ (τ, b) is convex in b
- for a fixed b, E ∗ (τ, b) is convex in τ
The divergence terms in (11) can be discretized as in [29]. For
are at the core of the half-quadratic algorithm, an iterated two- each (i, j) ∈ D we have:
step greedy minimization algorithm. The first step consists of
computing the minimum of E ∗ with respect to b with τ fixed, [div(bτFk ∇τk )]i, j ' β[bτNk ∇ N τk + bτSk ∇ S τk
followed by finding the minimum of E ∗ with respect to τ with
+ bτEk ∇ E τk + bτWk ∇ W τk ]i, j (12)
b fixed. These two steps are repeated until convergence. The
half-quadratic algorithm reads as follows:
where 0 ≤ β ≤ 14 is a coefficient for numerical stabilization
The first step, which consist of computing b(n+1) =
scheme; N , S, E, W are the mnemonic subscripts for north,
arg min{b} E ∗ (τ (n) , b), is equivalent to finding the minimum of
ZZ south, east and west. These subscripts put on the operator
∇ indicate nearest-neighbor difference in the corresponding
C ∗ (τ, b)dxdy.
Ω direction:
According to (9), the value of b which minimizes this ∇ N τk (i, j) ≡ τk (i − 1, j) − τk (i, j)
functional is computed analytically following (9) and is, here,
reached for b F = (bτF1 , bτF2 , bτF3 )t , where: ∇ S τk (i, j) ≡ τk (i + 1, j) − τk (i, j)
(13)
∇ E τk (i, j) ≡ τk (i, j + 1) − τk (i, j)
 Φ 0 (k∇τ1 k) 1

 bτF1 = =p = g(k∇τ1 k) ∇ W τk (i, j) ≡ τk (i, j − 1) − τk (i, j).


 2k∇τ 1 k 1 + k∇τ1 k2
Φ (k∇τ2 k)
0

 1 The coefficients bτNk , bτSk , bτEk , bτWk can be approximated by
bτF2 = =p = g(k∇τ2 k) (10)
 2k∇τ2 k 1 + k∇τ2 k2 the equations
Φ 0 (k∇τ3 k)

1


bτF3 =

 =p = g(k∇τ3 k).
2k∇τ3 k 1 + k∇τ3 k2 bτNk (i, j) ' g(|∇ N τk (i, j)|)

The second step, which computes τ (n+1) = arg min{τ } bτSk (i, j) ' g(|∇ S τk (i, j)|)
(14)
E ∗ (τ, b(n+1) ), consists of finding the minimum of bτEk (i, j) ' g(|∇ E τk (i, j)|)
bτWk (i, j) ' g(|∇ W τk (i, j)|).
ZZ
((It + s · τ )2 + λC ∗ (τ, b F ))dxdy.

The discretized version of (11) is then the system:
The Euler–Lagrange equations corresponding to this
functional are deduced in the same way as (7) and are given α[bτNk ∇ N τk + bτSk ∇ S τk + bτEk ∇ E τk + bτWk ∇ W τk ]i, j
by
= [sk(It + s · τ )]i, j (15)
λ div(bτ1 ∇τ1 ) = s1 (It + s · τ )
 F
 for k = 1, 2, 3 and where α = λβ. With this discretization we
λ div(bτF2 ∇τ2 ) = s2 (It + s · τ ) (11) have, at each (i, j) ∈ D:

λ div(bτ3 ∇τ3 ) = s3 (It + s · τ )
F

α b̄τ1 + s12 τ1
  
s1 s2 s1 s3
with boundary conditions  s1 s2 α b̄τ2 + s22 s2 s3  τ2 
∂τ1 ∂τ2 ∂τ3 s1 s3 s2 s3 α b̄τ3 + s32 τ3
= 0, = 0, = 0.
∂n ∂n ∂n ατ1 − s1 It
 

Table 1 summarizes the algorithm. = ατ2 − s2 It  (16)


Note that, because τ = T/Z , T and Z can be recovered only ατ3 − s3 It
up to a scale factor, i.e., only relative depth and the direction of
translation can be recovered (see Section 5). where the coefficients τ̄k and b̄τk are given by

3.2. Discretization τ̄k (i, j) = bτNk (i, j)τk (i − 1, j) + bτSk (i, j)τk (i + 1, j)
+ bτEk (i, j)τk (i, j + 1) + bτWk (i, j)τk (i, j − 1) (17)
We discretize the image domain Ω into a unit-spacing grid
D with points indexed by (i, j) i = 1, . . . , N , j = 1, . . . , M. b̄τk (i, j) = bτNk (i, j) + bτSk (i, j) + bτEk (i, j) + bτWk (i, j).
H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607 601

Coefficients τ̄k and b̄τk are computed using the neighborhood resolution of the linear system (22), for each pixel, is performed
of (i, j). Solving (16) gives, at each (i, j) ∈ D, by singular value decomposition rather than explicitly as in (18)
for the case of translational motion. We then proceed with the
1


 τ1 = (τ1 − s1 q) half-quadratic minimization in a manner identical to the case of
b̄τ1

translational motion. The only difference here is that we have



 1
τ2 = (τ2 − s2 q) (18) a vector of unknown variables of higher dimension (six instead

 b̄τ2 of three).
 1
τ3 = (τ3 − s1 q)



b̄τ3 5. Recovery of relative depth and image motion
where
Following convergence of the half-quadratic algorithm,
s1 b̄τ2 b̄τ3 τ1 + s2 b̄τ1 b̄τ3 τ2 + s3 b̄τ1 b̄τ2 τ3 + b̄τ1 b̄τ2 b̄τ3 It relative depth, as well as image motion, are recovered directly
q= .
α b̄τ1 b̄τ2 b̄τ3 + s12 b̄τ2 b̄τ3 + s22 b̄τ1 b̄τ3 + s32 b̄τ1 b̄τ2 from the six components of ρ (see Section 4 for notation). As
mentioned earlier, the translation and depth can be recovered
When (18) is written for all (i, j) ∈ D, we obtain a
only up to a scale factor. When T 6= 0, scale can be fixed by
large sparse system of linear equations which can be solved
imposing the constraint kTk = 1. Depth is then given by
efficiently using the Gauss–Seidel algorithm [30].
1
q
4. 3D interpretation for general rigid motion = ρ12 + ρ22 + ρ32 (24)
Z
A functional similar to (6) can be written for general rigid and image motion by
motion, i.e., a composition of instantaneous translation and
f 2 + x2

xy
rotation. The extension is straightforward. We will do some u = fρ1 − xρ3 − ρ4 + ρ5 − yρ6


notation change so that the formulation will have a compact f f (25)
form. We construct the two six-dimensional (6D) vectors r and f 2 + y2 xy
v = fρ2 − yρ3 − ρ4 + ρ5 + xρ6 .


ρ via the concatenations r = [s, q] and ρ = [τ, ω]. Then Eq. f f
(4) becomes:
6. Experimental verification
It + r · ρ = 0. (19)
The energy functional to be minimized becomes: We ran several experiments on synthetic and real sequences
to verify the validity of the method and its implementation. We
ZZ " 6
#
X give four examples.
E(ρ) = (It + r · ρ) + λ
2
Φ(k∇ρi k) dxdy (20) Reconstructed objects are displayed using gray-level
Ω i=1
rendering and anaglyphs of stereoscopic images constructed
where the positive real constant λ controls the degree of from the estimated depth. An anaglyph of a stereoscopic image
diffusion across boundaries in ρ. Here, again, we consider constructed from an image of the monocular sequence and
instead of (20) the minimization of the following energy its corresponding estimated depth map is a direct, convenient
6 :
functional defined through the auxiliary 6D field b = {bρ i }i=1 means of evaluating the results of a 3D interpretation. The
ZZ impression of depth that one gets from an anaglyph is perhaps
E ∗ (ρ, b) = ((It + r · ρ)2 + λC ∗ (ρ, b))dxdy (21) not of the same quality as with a video display using a scheme
Ω such as synchronized LCD glasses. However, anaglyphs are a
where good, inexpensive, and easily accessible alternative. Anaglyphs
6 are best viewed when printed on good-quality photographic
X
C ∗ (ρ, b) = (bρi k∇ρi k2 + ψ(bρi )). paper. When viewing on a cathode ray tube (CRT) screen,
i=1 a TIFF or EPS format (used here) is better with full-color
resolution. Stereoscopic images are constructed from an image
At each pixel (i, j) of the discrete grid D, we have the
and its corresponding depth map by a simple scheme described
following equation between the value of ρ at (i, j) and the value
in the Appendix. The anaglyphs were generated using the
of ρ and image data of points in the neighborhood of (i, j):
algorithm in [31] (courtesy of Eric Dubois). Included in the
Bρ = c. (22) results is the optical flow reconstructed from the algorithm’s
output (Eq. (1)), compared to the optical flow computed directly
The elements of the 6 × 6 matrix B and the 6 × 1 vector c
by the Horn and Schunck method, and to ground truth for the
are given by:
examples for which it is available. For proper viewing, optical
Bkl = δkl αbρk + rk rl flow is scaled before display.
(23) In each experiment we used two consecutive frames. The 3D
ck = αρ k − rk It
interpretation field (τ ) is initialized to zero. The coefficient λ
where we have set α = λβ, δkl is the Kronecker symbol, and which weighs the functional regularization term is determined
ρ k and bρk have the same expressions as in Eq. (17). The experimentally by viewing the estimated structure both as a
602 H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607

Fig. 2. (a) A frame of the Marbled-block sequence; (b) ground-truth image motion.

Fig. 3. (a) Recovered optical flow from Eq. (25); (b) image motion computed with the Horn and Schunck method (for comparison); (c) reconstructed depth
vizualised as a gray-level map.

gray-level map and an anaglyph. The algorithm is terminated in pixels, in which case the interpixel distance is one pixel,
when the computed variables cease to evolve significantly. then the focal length is measured in pixels. We used f =
Equations of interpretation, Eq. (7), for instance, require the 600 pixels. This corresponds to the focal length and interpixel
camera focal length. They also require the interpixel distance, distance of common cameras (8.5 mm and about 0.015 mm,
since image coordinates appear in these. If distance is measured respectively). Exact knowledge of these camera parameters
H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607 603

Fig. 4. (a) A frame of the teddy4 sequence; (b) reconstructed depth vizualised as a gray-level map; (c) recovered optical flow from Eq. (25); (d) image motion
computed with the Horn and Schunck method (for comparison).

is not critical. For instance, a zoom value of f = 1000 light source is positioned on the left. This causes the right-most
pixels leads to the same perceived structure. This is probably face of the blocks to be shadowed. As a result, the intensity
due to the fact that the computed structure is regularized by variations on these faces are significantly weakened. Finally,
smoothness. However, because the computed structure results depth presents sharp discontinuities at occlusion boundaries
from functional minimization, it is not clear, analytically, how which need to be preserved.
sensitive the algorithm is to errors in the camera calibration Reconstructed depth, computed from the 3D interpretation
parameters. by Eq. (24), is shown in Fig. 3(c) as a gray-level map. The
This first example uses the Marbled block sequence, which is reconstructed image motion (Eq. (2)) is displayed in Fig. 3(a).
a synthetic sequence from the image database of KOGS/IAKS Note that anisotropic regularization has preserved the occluding
Laboratory, Germany. The first of the two frames used is shown edges of each block as well as the edges between their visible
in Fig. 2(a). This sequence consists of images of two moving facets. Fig. 9 displays an anaglyph of a stereoscopic image
blocks in an otherwise static environment. Each block moves constructed from the first of the two images of the experiment
with a distinct translation. The larger block moves in depth and the estimated depth (see Appendix). The anaglyph image
and to the left, the other forward and also to the left. Ground- is to be viewed with chromatic (red–cyan) glasses (there are
truth image motion is shown in Fig. 2(b). Several aspects of common, inexpensive, commercial plastic glasses). Presented
this sequence make its 3D reconstruction challenging. The with this anaglyph, viewers experienced a strong sense of depth.
blocks cast shadows which cause a weak apparent boundary The second example uses the teddy4 sequence from the
movement. The blocks are covered by a macro texture, the image database of Carnegie Mellon University. This real
interior of the textons having weak spatio-temporal intensity sequence consists of images recorded by a camera in motion
variations. The texture of the top of the blocks is identical to to the right with the bear moving to the left against a textured
that of the background, causing two of the image-occluding backgound. Consequently, both the object and the background
edges to be very weak, barely noticeable in some places. The are in (distinct) motion relative to the viewing system. The first
604 H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607

Fig. 5. (a) A frame of the Ambler1b sequence; (b) recovered optical flow from Eq. (25); (c) image motion computed with the Horn and Schunck method (for
comparison).

Fig. 6. Recovered depth displayed as gray-level map with anisotropic smoothness factor λ = (a) 108 , and (b) 109 .

of the two frames that were used is displayed in Fig. 4(a). time, there are small but percepually important details (eyes,
This sequence also presents characteristics that make its 3D nose tip, bow tie). These small regions have strong intensity
interpretation challenging. There are large regions with no or boundaries but no or weak texture inside. The reconstructed
weak texture on the image of the toy. Regions of predominantly depth and image motion are shown, respectively, in Fig. 4(b)
uniform intensity carry little or no 3D information. At the same and (c). The image motion computed by the Horn and Schunck
H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607 605

Fig. 7. (a) A frame of the Flowergarden sequence; (b) recovered optical flow from Eq. (25); (c) image motion computed with the Horn and Schunck method (for
comparison).

Fig. 8. (a) Reconstructed depth vizualised as a gray-level map. (b) Reconstructed depth after smoothing image with a Gaussian filter.

algorithm is shown in Fig. 4(d). Fig. 10 displays an anaglyph This last example uses the Flowergarden sequence, which
for this example. contains images of a real-world scene. The variations in depth
This example uses the Ambler1b sequence (also from the here are richer and more complex than in the preceding
image database of Carnegie Mellon university). Camera motion examples, as depth varies from shallow (tree) to large (houses)
is nearly horizontal when the two frames used here were and infinite (sky). Also in this sequence, there are large regions
recorded. Fig. 5(a) shows the first of the two frames. The scene of very weak texture (sky, houses) and of fine texture (tree
contains a box placed on a nearly horizontal surface of sand. foliage). Also, the tree trunk has small regions of nearly
There are large regions of weak and fine textures (both on a sand
constant gray level. The sequence was acquired by a camera on
surface and block facets). The scene is illuminated by two light
a vehicle driving by the scene. Fig. 7(a) shows the first of the
sources which cause shadows around the block. The recovered
image motion and motion estimated with the Horn and Schunck two frames. Image motion recovered according to the scheme
method are displayed in Fig. 5(b) and (c), respectively. The and to the Horn and Schunck algorithm are shown respectively
reconstructed depth is displayed in Fig. 6(a) and (b) for two in Fig. 7(b) and (c). Reconstructed depth is shown in Fig. 8(a)
values of λ, the smoothness factor of anisotropy. With the larger as a gray-level map. The spots of nearly constant gray level on
λ, Fig. 6(a), we obtain a smoother depth for the sand but some the tree trunk are interpreted as being very distant. This is due
details, such as some of the box edges, are weakened. The to anisotropy which, as mentioned in the introduction (remark
anaglyph for this example is shown in Fig. 11. (a)) is aimed at preserving boundaries and, therefore, prevents
606 H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607

Fig. 11. Anaglyph image reconstructed for the Ambler1b sequence.


Fig. 9. Anaglyph image reconstructed for the Marbled-block sequence.

Fig. 12. Anaglyph image reconstructed for the Flowergarden sequence.

which would take full account of the temporal dimension by


Fig. 10. Anaglyph image reconstructed for the teddy4 sequence.
seeking an interpretation over the length of an image sequence.
the propagation of interpretation from textured to untextured Acknowledgment
regions of the image. That this is so is confirmed by the results
in (Fig. 8(b)) obtained by prior Gaussian smoothing of the This research was supported in part by the Natural Sciences
input images. In this case, Fig. 8(b), depth is smoother on the and Engineering Research Council (NSERC) under grant
tree trunk but fine detail in depth is weakened in some places OGP0004234.
(foreground tree branches, for instance). Fig. 12 displays an
anaglyph for this example. Appendix. Construction of a stereoscopic image from an
image and its depth map
7. Conclusion
Given an image and the corresponding depth map, we can
We presented a direct method for dense 3D interpretation construct a stereoscopic image using a simple scheme. Let I1
of temporal image sequences. This variational method is based be the given image. I1 will be one of the two images of the
on the 3D brightness constraint and anisotropic regularization stereoscopic pair and we construct the other image, I2 . Let S1
to preserve the boundaries of the 3D interpretation. Promising be the viewing system representing the camera that acquired
results have been obtained in several experiments with synthetic I1 , and S2 that of the other (fictitious) camera. Both viewing
and real image sequences. Because 3D interpretation of systems are as in Fig. 1. S2 is placed to differ from S1 by a
textureless image segments cannot be resolved on the basis of translation of amount d along the X -axis.
motion of these segments, we envisage incorporating a method Let (x2 , y2 ) be a point on the image position array of
for the propagation of interpretation from textured to untextured I2 , corresponding to a point P in space with coordinates
image segments. We are also considering a generalization (X 2 , Y2 , Z 2 ) in S2 . The coordinates of P in S1 are X 1 = X 2 +d,
H. Sekkati, A. Mitiche / Robotics and Autonomous Systems 55 (2007) 597–607 607

Y1 = Y2 , and Z 2 = Z 1 . The image of P in the image domain of dense correspondance methods, in: International Conference on Image
I1 are, according to our viewing system model (Fig. 1): Processing, vol. 2, Kobe, Japan, 1999, pp. 492–499.
[18] Y. Hung, H. Ho, A Kalman filter approach to direct depth estimation
X1 incorporating surface structure, IEEE Transactions on Pattern Analysis
x1 = f (26) and Machine Intelligence 21 (6) (1999) 570–575.
Z1
[19] S. Stein, M. Shashua, Model-based brightness constraints: On direct
y1 = y2 . (27) estimation of structure and motion, IEEE Transactions on Pattern Analysis
and Machine Intelligence 22 (9) (2000) 993–1005.
Because depth has been estimated, coordinates (x1 , y1 ) are [20] S. Srinivasan, Extracting structure from optical flow using the fast error
known. Image I2 , which will be the second of the stereoscopic search technique, International Journal of Computer Vision 37 (3) (2000)
pair, is then constructed as follows: 203–230.
[21] T. Brodsky, C. Fermuller, Y. Aloimonos, Structure from motion: Beyond
I2 (x2 , y2 ) = I1 (x̃1 , y1 ) (28) the epipolar constraint, International Journal of Computer Vision 37 (3)
(2000) 231–258.
where x̃1 is the x-coordinate of the point on the image positional [22] F. Martinez, J. Benois-Pineau, D. Barda, Extraction of the relative depth
array of I1 with x coordinate closest to x1 . Alternatively, one information of objects in video sequences, in: International Conference on
can use interpolation. However, we found it unnecessary for Image Precessing, Chicago, IL, 1998, pp. 948–952.
[23] F. Morier, H. Nicolas, J. Benois, D. Barda, H. Hanson, Relative depth
our purpose here.
estimation of video objects for image interpolation, in: International
Conference on Image Precessing, 1998, pp. 953–957.
References [24] J. Rissanen, Modeling by shortest data description, Automatica 14 (1978)
465–471.
[1] J. Aggarwal, N. Nandhakumar, On the computation of motion from a [25] Y. Leclerc, Constructing simple stable descriptions for image partitioning,
sequence of images: A review, Proceedings of the IEEE 76 (8) (1988) International Journal of Computer Vision 3 (1) (1996) 73–102.
917–935. [26] H. Sekkati, A. Mitiche, Dense 3D interpretation of image sequences:
[2] T. Huang, A. Netravali, Motion and structure from feature correspon- A variational approach using anisotropic diffusion, in: International
dences: A review, Proceedings of the IEEE 82 (2) (1994) 252–268. Conference on Image Analysis and Processing, Mantova, Italy, 2003.
[3] O. Faugeras, Three Dimensional Computer Vision: A Geometric View [27] H. Longuet-Higgins, K. Prazdny, The interpretation of a moving retinal
point, MIT Press, Cambridge, MA, 1993. image, Proceedings of the Royal Society of London, Series B 208 (1981)
[4] A. Mitiche, Computational Analysis of Visual Motion, Plenum Press, 385–397.
New York, 1994. [28] D. Geman, G. Reynolds, Constrained restoration and the recovery of
[5] A. Zisserman, R. Hartley, Multiple View Geometry in Computer Vision, discontinuities, IEEE Transactions on Pattern Analysis and Machine
Cambridge University Press, Cambridge, MA, 2000. Intelligence 14 (3) (1993) 367–383.
[6] B. Horn, B. Schunk, Determining optical flow, Artificial Intelligence 17 [29] P. Perona, J. Malik, Scale space and edge detection using anisotropic
(1–3) (1981) 185–203. diffusion, IEEE Transactions on Pattern Analysis and Machine
[7] G. Aubert, R. Deriche, P. Kornprobst, Computing optical flow via Intelligence 12 (7) (1981) 629–639.
variational thechniques, SIAM Journal of Applied Mathematics 60 (1) [30] P. Ciarlet, Introduction a l’analyse numerique matricielle et a
(1999) 156–182. l’ optimisation, fifth ed., Masson, Paris, 1994.
[8] A. Mitiche, P. Bouthemy, Computation and analysis of image motion: [31] E. Dubois, A projection method to generate anaglyph stereo images,
A synopsis of current problems and methods, International Journal of in: Proc. International Conference on Acoustics, Speech, and Signal
Computer Vision 19 (1) (1996) 29–55. Processing, vol. III, Salt Lake City, USA, 2001, pp. 1661–1664.
[9] J. Mellor, S. Teller, T. Lozano-Perez, Dense depth map from epipolar
images, in: DARPA Image Understanding Workshop, 1997, pp. 893–900. Dr. Hicham Sekkati received the M.Sc. and Ph.D.
[10] Z. Marco, J. Victor, H. Christensen, Constrained structure and motion degrees in 2003 and 2005 from the Institut National
estimation from optical flow, in: International Conference on Pattern de la Recherche Scientifique (INRS-EMT), Montreal,
Recognition, vol. 1, Quebec, 2002, pp. 339–342. Canada. The subject of his Ph.D. thesis was variational
[11] H. Sekkati, A. Mitiche, Joint dense 3D interpretation and multiple motion segmentation and 3D interpretation of monocular
segmentation of temporal image sequences: A variational framework with image sequences. In 2005, he joined the Computer
active curve evolution and level sets, in: International Conference on Vision Laboratory at the University of Miami, FL, as
Image Processing, Singapore, 2004, pp. 549–552. a postdoctoral associate, pursuing research on opti-
[12] A. Mitiche, S. Hadjres, MDL estimation of a dense map of relative depth acoustic image processing for underwater applications.
and 3D motion from a temporal sequence of images, Pattern Analysis and
Prof. Amar Mitiche holds a Licence Es Sciences
Applications (6) (2003) 78–87.
in mathematics from the University of Algiers and
[13] G. Adiv, Determining three-dimensional motion and structure from a Ph.D. in computer science from the University
optical flow generated by several moving objects, IEEE Transactions on of Texas at Austin. He is currently a professor
Pattern Analysis and Machine Intelligence 7 (4) (1985) 384–401. at the Institut National de Recherche Scientifique
[14] S. Negahdaripour, B. Horn, Direct passive navigation, IEEE Transactions (INRS), department of telecommunications (INRS-
on Pattern Analysis and Machine Intelligence 9 (1) (1987) 168–176. EMT), in Montreal, Quebec, Canada. His research
[15] B. Horn, E. Weldon, Direct methods for recovering motion, International is in computer vision. His current interests include
Journal of Computer Vision 2 (2) (1988) 51–76. image segmentation, motion analysis in monocular and
[16] R. Laganiere, A. Mitiche, Direct Bayesian interpretation of visual motion, stereoscopic image sequences (detection, estimation, segmentation, tracking,
Journal of Robotics and Autonomous Systems 14 (4) (1995) 247–254. 3D interpretation) with a focus on level set methods, and written text
[17] R. Chellappa, S. Srinivasan, Structure from motion: Sparse versus recognition with a focus on neural networks methods.

You might also like