Professional Documents
Culture Documents
A Platform For Audiovisual Telepresence Using Model - and Data-Based Wave-Field Synthesis
A Platform For Audiovisual Telepresence Using Model - and Data-Based Wave-Field Synthesis
A Platform For Audiovisual Telepresence Using Model - and Data-Based Wave-Field Synthesis
Convention Paper
Presented at the 125th Convention
2008 October 25 San Francisco, CA, USA
The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer
reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the authors advance
manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents.
Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York,
New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not
permitted without direct permission from the Journal of the Audio Engineering Society.
1. INTRODUCTION
Telepresence environments have long been investigated
in research and real-world applications. By recording a
situation in different perceptual modalities in a remote
location and reproducing it locally, this technology enables people to feel as if they are actually present in a different place or time [10], mostly by transporting important audiovisual cues, which are complemented in many
Heinrich et al.
Audiovisual Telepresence
still is) immersion of the listener into a recorded auditory scene something that can well be interpreted as
auditory telepresence.
To establish a balance between the visual and the auditory modality is the rationale behind the development
of our experimental hardware/software telepresence platform, Omniwall. It integrates high-definition visual
telepresence approaches and works towards adequate auditory telepresence.
This article is aimed at introducing the Omniwall approach and to report on our work in progress especially in
the audio part. Specifically, Section 2 introduces the scenarios considered and general approaches taken to implement telepresence in them, while the rest of the paper focusses on the audio subsystems developed. Section
3 reviews the model-based audio acquisition and corresponding wave-field synthesis methods implemented in
a hardware prototype, including some empirical results.
With model-based, we refer to systems that use geometric information and separate source signals to render a
wave field according to a wave-propagation model. As
a complement to this, in Section 4 we present a design
for a data-based wave-field analysis and synthesis system, which we verify with some simulative experiments.
Finally, in Section 5, we draw general conclusions and
discuss future work.
2.
Planar scenario: This corresponds to typical office conferencing systems where a screen or larger
wall provide a window to the remote communication party or parties. A special, more telepresenceoriented case of such a setup is a shared reality
where the remote space is projected locally as an
extension of the local space, e.g., with a table that
remote and local conferencing partners virtually sit
next to each other.
To address these two scenarios and corresponding application domains, we have developed one software system
with two different hardware setups: a linear and a circular (or cylindrical) setup. In the following we will briefly
describe that environment and the utilised technologies
and approaches, whereas the audio wavefield based approach will then follow in detail in sections 3 and 4.
2.1.
Circular Omniwall
The circular telepresence system is based on 360
(panoramic) audio-visual acquisition and rendering.
There exist many approaches in the literature, covering
aspects like tiled projection on different geometries, camera cluster based video recording or the integration with
wavefield analysis/synthesis, e.g., [13, 1, 9]. For the
panoramic scenario, we chose to use a radial symmetric setup both for the video and the audio part: A camera
ring and a corresponding 360 degree projection canvas
constitute the visual reproduction system. In the same
way the microphones and speakers are aligned (see Fig.
1(a) and 1(b)).
Heinrich et al.
Audiovisual Telepresence
Linear Omniwall
Although the circular telepresence setup potentially allows a high degree of audiovisual immersion, its space
requirements are excessive for many real-world applications. Therefore, a specific version of the Omniwall has
been developed that relies on the shared space concept
outlined above, i.e., the extension of one conferencing
room by a projection of the remote space, which can in
principle be created symmetrically. The setup of the linear Omniwall is presented in Fig. 2(a) and 2(b). The
basic principle we follow for sound recording is based
Heinrich et al.
Audiovisual Telepresence
3.1.
Heinrich et al.
Audiovisual Telepresence
is done, leading to a line as the boundary of the enclosure, which can be placed at the height of ears of
the audience. As a result, the distance attenuation
of the reproduced wave field is different from that
field rendered by a three-dimensional setup and can
be adjusted.
(1)
For our telepresence approach, we implemented a modelbased WFS system using the simplifications discussed
above. Using Verheijens derivation of model-based
WFS via the Rayleigh II integral [21], from each primary
source p (signal modelled) to each secondary source q
Heinrich et al.
Audiovisual Telepresence
[2]. Similar to their system, we utilize visual information (face localisation) to guide audio acquisition.
3.4.
Beamforming-based acquisition
For audio acquisition in the linear Omniwall, two standard approaches [14, 24] for microphone beamforming
are employed to spatially separate recording of multiple conference attendees in our scenario, whereas we are
able to run the system in three different operation modes:
Standard Least-Squares: The required spatiotemporal response (target response) of the array is
approximated for the given FIR-Filter coefficient
using a standard linear Least-Squares optimization.
Steerable Least-Squares: The filter coefficients are
transformed once to a new functional basis (using
Legendre polynoms), which allows for simple steering, and transformed back on demand.
Coefficient look-up: The coefficents are computed
offline in advance, for a certain quantization of angles (and distances) and are stored in memory. This
method is most efficient, since it involves no computation at all besides multichannel filtering.
The individual source signals are rendered at the remote
location in the WFS-based setup using a pre-defined distance. The basic principle of integrating beamforming
and WFS is motivated by prior work like presented in
Heinrich et al.
Audiovisual Telepresence
Background
Hulsebos and colleagues have shown [12] that for circular microphone array geometries, fields incident from
arbitrary azimuths can be extrapolated in the plane using
a cylindrical harmonics decomposition and its relation to
the plane-wave decomposition. In their work, they consider paired coincident ideal monopole and dipole elements on the array, with infinitesimal separation. In later
work [8], de Vries and colleagues extended this to cardioid microphones, again with ideal assumptions.
A WFAS operator
To obtain our WFAS operator, we adapt the findings from
the mentioned literature to our requirements described
above.
Heinrich et al.
Audiovisual Telepresence
(3)
(1)
Pk () = M(1)
k ()Hk (kRm )
Pk () =
1 X
P(m, ) exp( jmk /2)
2 m=1
Vk () =
1 X
V(m, ) exp( jmk /2)
2 m=1
K
X
(4)
K
X
(5)
k =K
V(m, ) =
k =K
(2)
+ M(2)
k ()Hk (kRm )
0 cVk () =
(1)
M(1)
k ()Hk (kRm )
(2)
+ M(2)
k ()Hk (kRm )
(6)
(7)
where M(1,2)
k () are the circular harmonic coefficients
Heinrich et al.
Audiovisual Telepresence
be achieved with single signals from realistic microphones replacing the separate signals of the ideal coincident pairs.
Recent work by de Vries and colleagues [8] goes a step
in that direction: an approach for an ideal cardioid characteristic is given based (1) on the theoretical view of
a cardioid as the average of a monopole and a dipole,
as well as (2) the assumption that the halfspace within
the array be source-free. These two conditions combined lead to two simplifications: (1) only one signal
is necessary per array position, and (2) incoming and
outgoing waves can be set equal at the array radius,
(2)
{P, V}(1)
k = {P, V}k = {P, V}k , allowing to define an averaged CHD coefficent. Some restructuring of the above
equations leads to:3
1 X
( j)k Mk () exp( jk )
P(,
) =
k =K
(13)
Mk () =
(2)
M(1)
k () + Mk ()
2
Pk () + j0 cVk ()
S k () =
2
(kRm )Hk(2)
= 2Mk () Hk(1)
(kRm )
(1)
(2)
j(Hk (kRm )Hk (kRm )
= 4Mk ()(Jk (kRm )
jJk (kRm ))
(8)
(9)
(10)
(11)
where K = (M 1)/2.
P(,
+ ) cos()
P(, Rq , ) = 2
4 2
exp( jkRq (1 cos )) d .
(14)
Heinrich et al.
Audiovisual Telepresence
N
jk2 X
P(, q + ) cos()
2
4 =N
(15)
3500
5
3000
10
2500
15
2000
1500
20
1000
25
500
30
20
40
60
80
100
120
140
160
180
10
10
10
10
500
1000
1500
2000
2500
3000
Experiments
The goal of this contribution is to assess the quality of
field reproduction using the described WFAS approach.
One of the central parts of such an assessment is the introduction of realistic side conditions. This includes to
Heinrich et al.
Audiovisual Telepresence
(a) r = 10 m, f = 100 Hz
Fig. 7: Instantaneous rendered field pressure at different frequencies (x, y-coordinates = cm).
in the center of the array, with normalised error below
10% over an area with approximately the double radius
of the microphone array for frequencies around 1kHz.
Again, similar to HOA, the region of good reproduction
increases with the wavelength (cf. 100 Hz in Fig. 7(a)
and 3000 Hz in Fig. 7(e)), and again, from first qualitative evaluation, the dependence on the frequency seems
smaller. This is true for both the far and near source.
5.
CONCLUSIONS
In this paper, we have presented (1) a reproduction environment for audiovisual scenes in high resolution. For
this, we have shown the general setup in two scenarios. Further, we have (2) presented a study of databased wave-field synthesis method to analyse its viability in the telepresence setup. Opposed to the literature on
data-based wave-field synthesis, where the method was
Heinrich et al.
Audiovisual Telepresence
(a) r = 10 m, f = 500 Hz
(b) r = 10 m, f = 1000 Hz
(c) r = 10 m, f = 1500 Hz
Heinrich et al.
Audiovisual Telepresence
ACKNOWLEDGEMENTS
REFERENCES
[10] Scott Fisher and Brenda Laurel. Be there here. InterCommunication, 1992.
[11] R. Hldrich and A. Sontacchi. Wellenfeldsynthese
erweiterungen und alternativen. In DAGA 2005, 31.
Deutsche Jahrestagung fr Akustik, Mnchen, 2005.
[12] Edo Hulsebos, Diemer de Vries, and Emmanuelle
Bourdillat. Improved microphone array configurations for auralization of sound fields by wave field
synthesis. J. Audio Eng. Soc., 50(10):779790,
2002.
[13] S. Ikeda, T. Sato, and N. Yokoya. High-resolution
panoramic movie generation from video streams
acquired by an omnidirectional multi-camera system. Multisensor Fusion and Integration for Intelligent Systems, MFI2003. Proceedings of IEEE International Conference on, pages 155160, July-1
Aug. 2003.
[14] Lucas C. Parra. Steerable frequency-invariant
beamforming for arbitrary arrays. Journal of the
Acoustical Society of America, 119 (6):38393847,
2006.
[15] Mark Poletti. A unified theory of horizontal holographic sound systems. J. Audio Eng. Soc., 48(12),
12December 2000.
[16] Ville Pulkki. Virtual sound source positioning using vector base amplitude panning. Journal of the
Audio Engineering Society, 45(6):456466, June
1997.
[17] R. Rabenstein and S. Spors. Spatial sound reproduction with wave field synthesis. In Atti del
Congresso 2005, Audio Engineering Society, Italian Section, Como, Italia, Nov 2005.
[18] K. Shinoda, H. Mizoguchi, S. Kagami, and K. Nagashima. Visually steerable sound beam forming
method possible to track target person by real-time
visual face tracking and speaker array. Systems,
Man and Cybernetics, 2003. IEEE International
Conference on, 3:21992204 vol.3, Oct. 2003.
[19] S. Spors, H. Buchner, and R. Rabenstein.
Eigenspace adaptive filtering for efficient preequalization of acoustic MIMO systems.
In
Proc. European Signal Processing Conference
(EUSIPCO), Florence, Italy, Sep 2006.
Heinrich et al.
Audiovisual Telepresence
[20] Sascha Spors. Active Listening Room Compensation for Spatial Sound Reproduction Systems. PhD
thesis, University of Erlangen-Nuremberg, 2006.
[26] Joerg Wuttke. Surround recording of music: Problems and solutions. In Proc. 119 AES Convention,
2005.
[27] Sylvain Yon, Mickael Tanter, and Mathias Fink.
Sound focusing in rooms: The time-reversal approach. JASA, 113:15331543, 2003.