2 AES Immersive

FEATURE ARTICLE
Immersive audio
Objects, mixing, and rendering
Francis Rumsey
Consultant Technical Writer
As immersive audio systems and production techniques gain greater

prominence in the market, the need for cost-effective solutions becomes
apparent. Existing systems need to be adapted to enable object-based
production techniques. “Ideal” reproduction solutions are having to be
rationalized for practical purposes. Headphones offer one possible destination
for immersive content, without excessive hardware requirements.
A
ny visitor to the recent 140th Conven- Kraft and Zölzer describe a signal-decom- Second, left and right ambience signals are
tion in Paris would have been left in position and source-separation approach assumed to sound similar but be decorrelated
no doubt about the importance of the based on mid-side (sum and difference) and to have an amplitude that is much lower
topic of immersive audio. The term immer- processing in the frequency domain, than the direct sound.
sive audio seems to be emerging as the most enhanced by improved ambience signal There are numerous possible immer-
commonly used to describe systems and processing for stereo to 3D upmixing. The sive audio loudspeaker layouts, and the
techniques that deliver spatial audio content decomposed direct
from all around the listener, although “3D” sound signals are
is used almost interchangeably by some. repanned using
Alongside the many demonstrations and VBAP (vector-base
workshops related to the theme were a num- amplitude panning),
ber of research papers, and some of these while ambient
are summarized here for the benefit of those sound is processed
that don’t have time to read them in depth. for the target chan-
nels in question
STEREO TO 3D UPMIXING using decorrelation
In their paper, “Low-Complexity Stereo Sig- filters. A number of
nal Decomposition and Source Separation assumptions enable
for Application in Stereo to 3D Upmixing” the two-chan-
(paper 9586), Sebastian Kraft and Udo Zöl- nel original to be
zer point out that while the movie industry decomposed reason-
has adapted to multichannel content pro- ably successfully.
duction, music is still almost all produced First it’s assumed
in two-channel stereo. Upmixing two-chan- that at any moment
nel content for surround and immersive in time, and in any
reproduction formats is therefore an attrac- frequency band,
tive proposition if it can be made to deliver only one dominant Fig. 1. 9-channel immersive audio loudspeaker setup (Figs. 1 and 2
a convincing experience. source will be active. courtesy Kraft and Zölzer)
584 J. Audio Eng. Soc., Vol. 64, No. 7/8, 2016 July/August
FEATURE ARTICLE
DELIVERY important to be able to deliver content

USING JOINT over broadcast or streaming links at low bit
OBJECT rates. The approach they describe is used
CODING both in the AC-4 system, recently standard-
Object-based repre-
ized by ETSI, and a backwards compatible
sentation of spatial
extension of Dolby Digital Plus.
audio has received
Essentially the joint object coding (JOC)
a lot of attention in
method employed in this context creates a
recent years. Essen-
downmix of the immersive content, using
tially it concerns
a renderer that takes as its input the object
the breaking down
audio and its metadata. Alongside this are
of audio scenes into
sent parameters that enable individual audio
their constituent
objects to be approximately reconstructed
sound “objects,”
from the downmix. The object signals are
which can then be
separated from the downmix using a form
represented inde-
of time-frequency analysis that breaks each
pendently along
temporal frame (typically 32–43 ms) into 64
with metadata frequency bands, which are then grouped
that controls their
into 7–12 parameter bands for coding. The
Fig. 2. Decorrelation filter tree for generating eight ambience signals
position and other
downmix is perceptually encoded. This
parameters. Itprocess is broadly illustrated in Fig. 3. In
approach described can be adapted for can, however, also be adapted for other the decoder the process is reversed (Fig.
any of them, but in the example tested a ways of representing spatial audio con- 4), with the downmix being extracted and
9-channel version was used, which puts tent, such that a loudspeaker feed from a the approximate objects reconstructed,
four additional loudspeakers in a “height” channel-based representation could also being sent to a renderer that does its best
layer above a conventional 5.1 layer (see be defined as an object routed to just one to rebuild the intended spatial scene for the
Fig. 1). The authors explain that the direct static location, for example. Such a method replay system in question using the object
sound components are only repanned in one of representation naturally places a lot of metadata. A decorrelator can be employed in
horizontal (2D) plane at the level of the 5.1 responsibility upon the ultimate “renderer” the decoder to improve the quality of recon-
loudspeakers, but the ambient signals are that takes all these descriptive data and struction of the audio objects.
reprocessed to create an immersive sound content streams and turns them into a Because the number of objects in some
field involving the upper loudspeakers. A plausible representation of the producer’s movie scenes can be very great, a form of
tree of decorrelation filters is used to create intentions on whatever reproduction sys- reduction takes place during a preprocessing
an even number of independent loudspeaker tem it has to deal with. stage, which creates spatial object groups
feeds, as shown in Fig. 2. In this case the In their paper “Immersive Audio Delivery
based on loudness and spatial distortion
center channel is not fed. It’s said that addi- Using Joint Object Coding” (paper 9587), metrics (Fig. 5). The aim is to minimize
tional high- and low-pass crossovers can be Heiko Purnhagen and his colleagues from the spatial error when rendered to typical
employed to feed the layers with different Dolby Sweden describe an approach to consumer speaker layouts involving some
frequency ranges to create the impression of delivering object-based representations 7–20 loudspeakers, where the spatial resolu-
height, as suggested by Lee. of immersive audio content to consumer tion is likely to be lower than it would be in a
Evaluating such an approach is not environments. In such an application it’s movie theater with 30–64 loudspeakers.
trivial, they say, because the extracted
ambience might sound realistic but is
different to the real ambient signal used
in the original recording. For this reason
the authors didn’t attempt a subjective
evaluation, but looked at factors such as
interchannel cross-correlation and level
difference, comparing their system with
similar approaches by Avendano, Faller,
and Goodwin. They reported results that
were quite close to real ambient signals,
creating “a convincing immersive listen-
ing experience while preserving the sound
characteristics of the original stereo track.”
Reverberation is not added and the upmix is Fig. 3. Block diagram of an encoder implementing the joint object coding paradigm (Figs. 3–5
said to be downmix compatible. courtesy Purnhagen et al.)
J. Audio Eng. Soc., Vol. 64, No. 7/8, 2016 July/August 585
FEATURE ARTICLE
based representation assumes that the

sound scene is directionally encoded using
a number of eigen functions, such as the
spherical harmonics of ambisonics. Object-
based representation assumes that each
stream represents a distinct sound object
accompanied by metadata that describes
its position and spatial extent, which may
change over time.
Fig. 4. Block diagram of a decoder implementing the joint object coding paradigm Compatibility issues exist between the
various different systems used for encoding
sources can be
or delivering such content to the end user, so
formed “between”
the authors decided to use the ADM structure
audio objects, and
as an open “pivot” format in the production
indeed whether
chain. (The ADM format has an XML tagged
phantom sources
structure for representing the metadata.)
can be turned into
One key reason was that ADM does not have
audio objects.
direct proprietary relationships and has been
There’s also a
the subject of agreement among all the play-
contribution from
ers in an ITU working group. ITU-R WP6C
Etienne Corteel and
also aims to define a baseline renderer that is
Fig. 5. Illustration of object-based audio content before (left panel) and a group of French
expected to support ADM.
after (right panel) immersive interchange translation where the orange
and blue dots in the left panel represent 19 original objects and where partners from As far as the EDISON 3D Project is
the red circles in the right panel represent 11 spatial object groups. the EDISON 3Dconcerned, there are three main elements
Project, a national
in the production chain—the digital audio
Results of listening tests using a 7.1.4 research project aiming to understand the workstation (DAW), so-called “Performer”
loudspeaker configuration suggested that factors involved in 3D audio production, and software, and an audio renderer, as shown
high quality can be achieved at bit rates of to propose tools that might help the indus- in Fig. 6. In this case the Performer is used
256 kbit/s and higher. try adapt to the new ways of representing
to deal with the spatial behavior of audio
audio scenes. “An Open 3D Audio Production
objects, working in conjunction with the
OBJECT-BASED RECORDING Chain Proposed by the Edison 3D Project”
DAW package. The concept of object groups
METHODS (paper 9589) starts by outlining the proj-
is introduced here, being a combination
Object-based audio structures describe the ect and explaining how it is concerned with of audio streams from related objects that
spatial layout of the scene independently format adaptation of content, 3D temporal/ might represent say a multisource object
of the reproduction loudspeaker format. spatial mixing, 3D sound rendering, and the (e.g., a choir), or a stereo pair of micro-
This is particularly relevant now that there definition of an open “pivot” format for 3D phones. The group can be manipulated as
are so many possible ways of rendering audio and associated tools. a whole using a graphical interface. Object
immersive audio, including numerous Based on the options described in EBU
trajectories can be controlled in real time to
loudspeaker options and binaural over Tech. 3364, which describes the metadata enable live performance options. The result-
headphones. One of the advantages of specification of an Audio Definition Model ing information is sent to the renderer,
object representation is that it makes it rel- (ADM), the authors
atively straightforward to remap content for d i s t i n g u i s h t h e
any reproduction system, but it suits some three main ways
kinds of content more than others, partic- in which immer-
ularly material that has been created with sive audio content
multiple independent sources. It is possi- can be represented
bly less well suited to, say, classical music in that structure.
content that will often be recorded using a C h a n n e l - b a s e d
small number of microphones that capture r e p r e s e n t a t i o n
the entire scene, including reverberation. assumes a conven-
Paper 9570, “Object-Based Audio Recording tional relationship
Methods,” from the 140th Convention, by b e t w e e n a u d i o
Jean-Christophe Messonnier and his col- streams and loud-
leagues, attempts to look into how con- s p e a k e r c h a n -
ventional acoustic recording methods can nels, and relates
be adapted to object-based representation, to legacy formats Fig. 6. Elements of the EDISON 3D audio chain (Figs. 6 and 7 courtesy
It considers things like whether phantom such as 5.1. Scene- Corteel et al.)
FEATURE ARTICLE
fluctuations of impulse
response segments occur-
ring 15–50 ms after the
first arrival contributed
to higher externalization.
Numerical optimization
was used to select BRIRs
that delivered the lowest
spectral distortion, and
results were tuned using a
small listening panel.
In listening tests it was
found that the new system
was preferred over stereo
Fig. 7. Architecture of the EDISON 3D ADM/BWF export system for between 60 and 95% of
listeners, with the average
which attempts to build a spatial audio headphone presentation, largely because of overall preference being
scene for various different output formats. the timbral changes introduced by HRTF 75%. One item showed a 50/50 split, but
In this case the two reference renderers processing, but also partly because of others showed a majority preferring the
are the Sonic Emotion WFS system for problems with perceived spatial width. An new system. The precise reasons for these
loudspeaker arrays and the BinauralWave echoic headphone virtualizer was aimed at preferences were not explored in these
Max MSP suite for headphones (although that would deliver higher listener prefer- experiments but the authors suggested that
the latter can also handle VBAP rendering ences than one using conventional ampli- further tests could examine specific audio
for multispeaker layouts). Communication tude panning techniques. Externalization attributes in addition to preference.
between the system elements uses OSC was considered important but only to the
(Open Sound Control), which means extent that any side effects did not outweigh LIVE SPATIAL SOUND MIXING
that alternative renderers could also be the benefits. In these tests immersive audio An approach to live sound mixing combin-
employed, say the authors. content from Dolby Atmos printmasters ing object-based mixing with WFS render-
Because ADM is not yet implemented in was rendered to a 7.1.4 loudspeaker format ing is discussed by Etienne Corteel, Raphaël
commercial DAWs, a freely available library before being virtualized for headphones. As Foulon, and Frédéric Changenet in their
known as bbcat from BBC Research was a reference for conventional stereo, the 7.1.4 paper “A Hybrid Approach to Live Spatial
used (see http://www.bbc.co.uk/rd/publica- material was downmixed using standard ITU Sound Mixing” (paper 9527). The authors
tions/audio-definition-model-software). This downmix coefficients. The aim was to exceed say that WFS is the only sound rendering
enables ADM metadata plus associated audio 70% listener preference for the new head- technique that can deliver effective spatial
to be encapsulated into a Broadcast Wave phone version compared with the stereo rendering over the large listening areas
File (BWF) container in such a way that downmix, and that no test of the headphone involved in live sound events, but point to
the main BWF contents can still be read by virtualizer should perform worse than stereo. the challenge arising from the large num-
a conventional DAW. The ADM metadata A stochastic room model was employed ber of loudspeakers needed by the technol-
is stored in the aXML header of the file, as that did not have to obey the constraints ogy. In the application described here they
shown in Fig. 7. of real rooms, but enabled the capture of combine standard stereo techniques with
only the most perceptually relevant binau- spatial mixing based on WFS.
HEADPHONE VIRTUALIZATION ral impulse response (BRIR) features. The It is common to use a number of large
Considering that a large amount of enter- model delivered direct sound, plus early line arrays with fill loudspeakers in live
tainment material is enjoyed on head- reflections and a late reverberant tail made sound systems, in order to cover the audi-
phones these days, the importance of up of individual reflections with specific ence area with a more or less even SPL
headphone rendering cannot be underesti- directionality tailored to enhance the distribution. Because coverage is the
mated. Grant Davidson and his colleagues sense of externalization. By trying differ- primary aim in such applications, relatively
describe a novel approach to this in “Design ent lengths of BRIR it was found that the little attention is paid to spatial positioning
and Subjective Evaluation of a Perceptu- sound moved out of the listener’s head with of reproduced sources. It’s suggested that
ally-Optimized Headphone Virtualizer” more than 10 ms of tail, and that the effect spatial sound reinforcement based on WFS
(paper 9588). Here the aim is to maximize stopped increasing after 30–70 ms, depend- enables naturally enhanced intelligibility
perceived externalization by simulating ing on the nature of the sound source and because of the increased ability of listen-
reproduction in a virtual room, while main- the size of the room. For this reason they ers to distinguish sources spatially from
taining a natural timbral balance. set the reverb duration in the system to one another and the resulting reduction in
The authors explain that listeners often only 80 ms, which delivered the sense of masking. This in turn can lead to reduced
prefer conventional loudspeaker stereo over externalization without noticeable echoes. use of compression and EQ, resulting in a
various forms of spatial enhancement for Also it was found that strong azimuthal more natural sound quality. The perception
J. Audio Eng. Soc., Vol. 64, No. 7/8, 2016 July/August 587
FEATURE ARTICLE
of individual loudspeakers as sources is were undertaken in a 300

much reduced compared with conventional seat Dolby Atmos cinema,
sound reinforcement systems. with MUSHRA-style subjec-
Typical WFS-based installations described tive tests being used to enable
in the paper seem to use a rather limited listeners to rate results. A
number of loudspeakers in two layers, reference configuration had
consisting of up to 16 compact loudspeak- the frontmost surround loud-
ers in a front-fill lower layer plus three to speakers bass managed to the
seven in the upper layer having enhanced screen subwoofer, and the
vertical directivity. These can be comple- rest of them bass managed
mented by an array of surround speakers. to left or right surround
With such a limited number of loudspeakers subwoofers (in the rear
one must assume that there could only be corners) as appropriate, with
limited successful WFS reproduction at low a crossover frequency of 80
frequencies, and the remainder of the paper Hz. The loudspeaker configu-
seems to imply that conventional stereo ration is shown in Fig. 8. As it
mixing techniques can be superimposed on wasn’t possible to install addi-
the same loudspeakers either as well as or tional loudspeakers during
instead of WFS object-based mixing. Virtual these experiments the vari-
loudspeakers can be created using WFS for ous new bass management
channel-based surround formats. configurations were simu-
lated by using existing loud-
COST EFFECTIVE BASS speaker elements in the room
MANAGEMENT to handle some parts of the
Immersive audio systems offer an enhanced frequency range, and altering
listening experience in movie theaters and the renderer routing and EQ
Fig. 8. Dolby Atmos loudspeaker layout as seen from
consumer environments, but they come accordingly. In some cases above in the cinema room used for listening tests on bass
with a cost and complexity that can be off- the mid bass was reproduced management (not to scale). The five screen loudspeakers are
putting. Toni Hirvonen and Charles Rob- by nearby surround loud- at the top. Different loudspeaker zones are indicated with
different colors. The two middlemost arrays (dark violet and
inson discuss experiments they conducted speakers and in other cases blue) are installed in the ceiling. Squares indicate the “full-
to assess the benefits of bass management the ceiling loudspeakers were range” screen loudspeakers and the “reduced low-frequency”
techniques that enable the use of cost- used. The details of the exper- surround loudspeakers. Diamonds indicate subwoofers.
(Courtesy Hirvonen and Robinson)
effective loudspeakers in such installations. iments and the results are
Their paper “Extended Bass Management much more involved than can
Methods for Cost-Efficient Immersive Audio be summarized here, and interested readers market the need for cost-effective solu-
Reproduction in Digital Cinema” (paper should refer to the original paper. tions becomes apparent. Existing systems
9595) looks into how the costs of instal- Results seemed to suggest that using the need to be adapted to enable object-based
lation can be kept manageable by using ceiling woofers for the 80–150 Hz range production techniques, and production sys-
smaller surround loudspeakers. had no benefit compared with the reference tems need to be integrated with immersive
In typical Dolby Atmos installations the arrangement, as the ceiling loudspeakers audio renderers and file exchange stan-
surround loudspeakers are bass managed, seemed to give rise to bass build-up effects. dards such as ADM. “Ideal” reproduction
with content below 100 Hz being routed When looking at higher crossover frequen- solutions involving very large numbers of
to separate subwoofers, which may be cies, using two mid-bass elements in the loudspeakers, originating in the laboratory
some distance from the main speakers ceiling with a 300-Hz turnover frequency or very high-end installations, are having to
in question. In previous experiments the seemed subjectively acceptable. The results be rationalized quite severely for practical
authors had found that provided the cross- seemed to point to the need for considering purposes, involving compromises whose
over frequency was kept below 100 Hz, the loudspeaker setups where low, mid, and subjective effects need to be carefully eval-
resulting spatial displacement of low-fre- high-frequency elements might be sepa- uated. Headphones offer one possible des-
quency content was acceptable. In the rately located, rather than the conventional tination for immersive content, without
current paper they looked into whether the idea of integrated full-range units. Overall excessive hardware requirements, provided
crossover frequency could be made higher there seemed to be mileage in looking that suitably convincing signal processing
in order to enable the use of more cost further into varying the spatial resolution can be developed to render material in a
effective main surround speakers, consider- employed in different frequency bands. way that sounds natural to listeners.
ing frequencies of around 150 Hz, or even
300 and 500 Hz for more speculative solu- SUMMARY Editor’s note: the papers discussed in
tions using nontraditional transducers. As immersive audio systems and production this article can be obtained from the AES
E-Library at: http://www.aes.org/e-lib/
In order to evaluate these ideas tests techniques gain greater prominence in the

2 AES Immersive

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 AES Immersive

Uploaded by

Copyright:

Available Formats

FEATURE ARTICLE

As immersive audio systems and production techniques gain greater

DELIVERY important to be able to deliver content

based representation assumes that the

of individual loudspeakers as sources is were undertaken in a 300

You might also like