Professional Documents
Culture Documents
a Sea of Cameras
Henry Fuchs, Gary Bishop, Kevin Arthur, Leonard McMillan
University of North Carolina at Chapel Hill
Ruzena Bajcsy, Sang Wook Lee, Hany Farid
University of Pennsylvaniay
Takeo Kanade
Carnegie Mellon Universityz
1 Introduction
In the near future, immersive stereo displays, three-
dimensional sound, and tactile feedback will be
increasingly capable of providing a sensation of presence
in a virtual environment Sutherland, 1968 Bishop et al.,
1992]. When this technology is applied for use in long-
range communication, the goal is to provide a sense of
telepresence to the participant.
The true promise of telepresence lies in its potential
for enhanced interactivity and increased
exibility
when compared to alternative technologies, such as
conventional teleconferencing. Telepresence should not
merely allow the viewing of remote participants, but it
should also allow us to participate in the same space with Figure 1: Concept sketch of a medical tele-consultation
Department of Computer Science, Chapel Hill, NC 27599-
scenario.
3175 e-mail ffuchs, gb, arthur, mcmillang@cs.unc.edu.
y GRASP Laboratory, Department of Computer and Informa-
tion Science, Philadelphia, PA 19104-6228 e-mail fbajcsy, swlee,
faridg@grip.cis.upenn.edu. 2 Our Approach
z The Robotics Institute, 5000 Forbes Avenue, Pittsburgh, PA
15213-3891 e-mail tk@cs.cmu.edu. We want to achieve a telepresence capability with
natural, intuitive interaction for each of the incremental updates to a locally maintained model
participants with little or no training requirements, Caudell et al., 1993 Ohya et al., 1993 Terzopoulos and
non-intrusive system sensors and displays,
Waters, 1993], (2) dynamic textures are mapped onto an
essentially static model Hirose et al., 1993], (3) images
independent control of gaze for each participant, from a multicamera conference room are projected onto
a large eld-of-view display, and (4) a boom mounted
support for multiple independent participants at stereo camera pair is controlled by the movement of a
each site and multiple (2 or more) sites. remote observer wearing a head-mounted display.
These considerations lead us to systems in which The rst two approaches assume that the remote
each participant wears a head-mounted display to look geometry is largely static or constrained to move along
around a remote environment whose surface geometries predened paths. Both approaches require a high-
are continuously sensed by a multitude of video cameras level understanding of the scene's composition, and
mounted along the walls and ceiling, from which depth they require that the scene be broken down into its
maps are extracted through cross-correlation stereo constituent parts. For instance, each individual person
techniques Fuchs and Neumann, 1993]. and object in the scene must be modeled. The use of
Views acquired from several cameras can then be these techniques has been practically limited to human
processed and displayed on a head-mounted display with models and simple objects and environments that have
an integrated tracking system to provide images of the been previously digitized.
remote environment. In this model the cameras perform The third and fourth approaches allow for both dy-
two functions: they individually provide photometric namic and unconstrained remote environments. A multi-
images, or textures, from dierent viewpoints within the camera wide-angle teleconference allows for limited gaze
scene, and, in combination, they are used to extract redirection, but it does not provide strong depth cues
depth information. nor does it allow for freedom of movement within the
The extraction of depth information from a set of two environment. We believe that the types and ranges
or more images is a well known problem in robotic vision of interaction are greatly reduced by these limitations.
Barnard and Fischler, 1982 Dhond and Aggarwal, Boom-mounted cameras, on the other hand, are perhaps
1989]. In the traditional application one or two cameras the closest approximation to an ideal telepresence set-up.
mounted to a mobile platform are used to acquire depth They provide for both stereoscopic and motion parallax
information at each pixel within the overlapping region depth cues and require little auxiliary processing of the
of the sensor arrays as the robot moves through the acquired images. The disadvantages of using boom-
environment. This same basic approach has also been mounted cameras are related to mechanical limitations
applied to computer graphics as a technique for static of the system the boom should have nearly the same
model generation Koch, 1993]. Our approach diers in range of motion as an actual observer, and the motion
that a multitude of permanently mounted cameras are of the boom-mounted cameras is intrusive and perhaps
used to acquire dynamic, or more accurately throw-away, even dangerous for the applications we envision. In
models of a scene. our proposed approach, there is no remote camera
In contrast to other applications of computer vision, positioning system that attempts to mimic the local
for telepresence we desire precise and dense depth participant's head movements. In eect there are only
information in order to provide high visual delity in the \virtual cameras" that correspond to the participant's
resulting display, and are not concerned with recognizing eye positions in the environment.
or constructing models of the objects in the scene or
with gathering information for navigation and collision-
avoidance for autonomous vehicles. 4 Depth Acquisition
We feel that the passive \sea-of-cameras" approach 4.1 Overview
to telepresence has a number of advantages over other
methods, and is very nearly feasible with current or The major steps in recovering depth information from
soon-to-be-available technology. One of the advantages a pair or sequence of images are: (1) preprocessing,
of the approach is that we need not generate a model (2) matching, and (3) recovering depth (see Dhond and
for objects that are not seen at a given time, because Aggarwal, 1989] for a review of stereo algorithms). The
we can generate the scene description from the point preprocessing stage generally consists of a rectication
of view of the remote camera whose line of sight most step that accounts for lens distortion and non-parallel
nearly matches that of the participant. axis camera geometry Tsai, 1987 Weng et al., 1992b].
The process of matching is the most important and
3 Previous Work dicult stage in most stereo algorithms. The matching
process determines correspondence between \features"
Previous approaches to telepresence tend to fall into one that are projections of the same physical entity in each
of the following categories: (1) a remote system provides view. Matching strategies may be categorized by the
primitives used for matching (e.g. features or intensity) value of dierences of intensities over a sampling window:
and the imaging geometry (e.g. parallel or non-
parallel optical axis). Once the correspondence between XX
n n j ( ) ; ^( ) j
2
I x y
(1)
I x y
high resolution in the `z' dimension, but increases the images through 3. The matching point of a point 0
C R P
likelihood of errors due to occluding boundaries and in image can be determined in image 1 by searching
C R
repetitive patterns in the scene. along an epi-polar line. Let the point 1 be the matching P
A stereo algorithm is presented that attempts to point in image 1. The matching point for 1 in image
R P
exploit, maximally, the benets of small and large R 2 can then be determined by searching about an epi-
baselines and mask sizes. In particular, a multi- polar line centered at the projection of 1 in image 2. P R
baseline, coarse-to-ne approach to stereo is adopted Finally, the matching point for 2 in image 3 can be P R
Okutomi and Kanade, 1993 Kanade, 1993 Farid et determined by searching about a epi-polar line centered
al., 1994], where several closely spaced views are taken at the projection of 2 in image 3. The disparity for 0
P R P
(multi-baseline) and matching across these views is is then simply 3x ; 0x, where ix is the component of
P P P x
done for several dierent mask sizes (coarse-to-ne). the point i . 1 In order to avoid errors due to occlusion,
P
The use of several views and mask sizes introduces a if the correlation error of a point in image i is above P
need for more sophisticated matching and combination a pre-dened threshold, then the previously matched
strategies. Several such control strategies are introduced point i;1 is directly projected into the last image in
P
combining information across varying mask sizes which image is projected into image + 1 as follows:
i i
matching, as opposed to feature matching, is used in the disparity map relating points in image to those in C
stereo algorithm presented in this section. In particular, image 3. The process is then repeated to compute a
R
matching correlation error is given as the sum of absolute 1 This assumes parallel axis camera geometry.
disparity map relating points in image to those in C multiple depth maps to create a larger model that is
image 3 .
L more consistent for a wider range of views.
In order to take advantage of the full baseline (image
3 to 3), it is necessary to \combine" the left and
6 Preliminary Results
L R
= ( L + R)
F
the re
ected color at each pixel. Using these synthetic
D D D
else if ( L R ) then F = 2 L
camera models, we were able to construct the geometry
C < C D D
else F = 2 R
of an environment using only the depth information
D D
Figure 5: View of the same synthetic camera scene, but Figure 8: A closer view showing the resolution of the
the user has moved forward in the room. extracted geometry.