You are on page 1of 5

Audio Engineering Society

Convention e-Brief 649


Presented at the 151st Convention
2021 October, Online

This Engineering Brief was selected on the basis of a submitted synopsis. The author is solely responsible for its
presentation, and the AES takes no responsibility for its contents. All rights reserved. Reproduction of this paper, or any
portion thereof, is not permitted without direct permission from the Audio Engineering Society.

MPEG-H Audio Production Workflows for a Next Generation


Audio Experience in Broadcast, Streaming, and Music
Yannik Grewe1, Philipp Eibl1, Daniela Rieger1, Matteo Torcoli1, Christian Simon1, and Ulli Scuda1
1
Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany
Correspondence should be addressed to Yannik Grewe (yannik.grewe@iis.fraunhofer.de)

ABSTRACT
MPEG-H Audio is a Next Generation Audio (NGA) system offering a new audio experience for various
applications: Object-based immersive sound delivers a new degree of realism and artistic freedom for immersive
music applications, such as the 360 Reality Audio music service. Advanced interactivity options enable
improved personalization and accessibility. Solutions exist, to create object-based features from legacy material,
e.g., deep-learning-based dialogue enhancement. 'Universal delivery' allows for optimal rendering of a
production over all kinds of devices and various ways of distribution like broadcast or streaming. All these new
features are achieved by adding metadata to the audio, which is defined during production and offers content
providers flexible control of interaction and rendering options. Thus, new possibilities are introduced, but also
new requirements during the production process are imposed. This paper provides an overview of production
scenarios using MPEG-H Audio along with examples of state-of-the-art NGA production workflows. Special
attention is given to immersive music and broadcast applications as well as accessibility features.

With its unique interactivity features, MPEG-H


1 Introduction Audio offers listeners the flexibility to actively
MPEG-H Audio is a Next Generation Audio (NGA) engage with the content and adapt it to their own
system based on the open international standard preference by selecting pre-defined presets e.g.
“Default Mix”, “Dialogue Enhanced Audio”, and
ISO/IEC 23008-3, MPEG-H 3D Audio [1].
“Venue Sound.”
It supports sound all around as well as above and
“Universal Delivery” enables that a single MPEG-H
below the listeners, enabling an immersive audio
stream can deliver content via streaming or
experience via broadcast, streaming, and music
broadcast platforms to all kinds of receivers and
services.
playback devices, from headphones to sound bars
In MPEG-H Audio, scenes can consist of object-
and discrete loudspeaker systems. The built-in
based (OBA), channel-based, and scene-based
renderer adapts the content to the playback
content, or any combination of them [2, 3].
capabilities of the end-consumer device.
Even though a three-dimensional auditory scene
The adoption of immersive and interactive audio
generally contains a larger amount of audio data
introduces new possibilities, but also poses new
than a legacy surround mix, this is mostly
requirements for production, distribution, and
compensated for by the high bitrate efficiency of the
reproduction. Efficient NGA production workflows
MPEG-H Audio codec. Typical 3D Audio setups
need to be defined without compromising well-
such as 5.1+4H or 7.1+4H can be broadcast using bit
established production infrastructure.
rates ranging from 192 kbit/s to 384 kbit/s, which
Production tools were developed to offer efficient
are average values for current 5.1 transmissions [3].
workflows for live and post-production [4]. During
NGA production trials, those workflows were
Grewe et al. MPEG-H Audio Production Workflows

evaluated with a focus on immersive music services,


interactive broadcasts, as well as solutions to add
advanced accessibility features based on legacy
stereo mixes. The workflows for these productions
are described in this paper.

2 Recording and Mixing


NGA overcomes the limitations of its legacy
predecessors. Nevertheless, the main principles of its
audio production workflows are similar.
Independent of the production scenario, be it
broadcast or music, MPEG-H Audio can be
produced in a traditional fashion. However, 3D
Audio signals and audio objects might be created,
processed, or extracted (see 3.6). During major Fig. 1: Schematic audio production and authoring
broadcast and streaming events, like the Rock in Rio workflow of the ESC MPEG-H production.
music festival [5], the European Athletics
Championships (EAC) [6], the 2020 Youth Olympic 3 Authoring of MPEG-H Audio Scenes
Games or the Eurovision Song Contest (ESC) [7], it
Metadata are a key element of NGA systems and
has been shown that traditional recording principles
serve a variety of purposes, from optimized
remain valid but are enhanced by capturing
playback on very different target devices to
techniques for height loudspeaker layer reproduction.
improved accessibility features.
Several microphone techniques to capture three-
dimensional auditory images have already been
3.1 Metadata structure
introduced and compared [8, 9]. During ESC, the
height effect was generated by adding an additional The MPEG-H Audio system standardizes a set of
4-channel Hamasaki Square microphone array to the metadata to define an audio scene. The set of
existing crowd microphones (see Fig. 1) [10]. At metadata consist of descriptive metadata, control
EAC, capturing height layer information was done metadata, playback-related metadata, and structural
using an ORTF 3D-Flat on the arena’s stand [6, 11]. metadata [12].
Additionally, 3D Audio bus structures within Digital Metadata inform the encoder and decoder about the
Audio Workstations (DAW) or broadcast mixing logical structure of signals and their nature.
and monitoring paths as well as 3D-Panning tools
are required. When using additional audio objects,
they must be kept separate from the other elements
of the mix, such as Music & Effects stems. The
process of generating metadata is called “authoring”
and it is the most important difference between the
production of MPEG-H Audio and legacy audio
creation.
After the authoring, audio and metadata are exported
to an intermediate file, called MPEG-H Audio
Master. Within an MPEG-H Audio Master, metadata
are always transmitted alongside the produced audio
essence. This ensures the integrity of the describing
control data in all parts of the production and
transmission chain.
Fig. 2. Metadata structure: components and presets.

AES 151st Convention, Online, October 2021


Page 2 of 5
Grewe et al. MPEG-H Audio Production Workflows

Depending on the production scenario, the level of An MPF file is a multi-channel wav in which the
complexity can vary from complex audio scenes metadata is modulated into a timecode-like audio
with multiple channels, objects, presets and user signal, which is typically located on the last channel
interactivity (e.g., for broadcast applications), to of the wav file, called “Control Track”. It can be
simple object-only productions for immersive music created and transmitted in real time and intercut with
services, like 360 Reality Audio. different MPFs (‘Configuration Change’ e.g., for
In Fig. 2, 13 signals are used to create an MPEG-H seamless ad-insertion), multiplexed with video, and
Audio scene. Ten of them are grouped into a passed through legacy audio transmission chains
channel-based component in a 5.1+4H layout. such as signal routers, AD/DA, and sample rate
Signals 11 and 12 are objects combined in a ‘Switch converters. The Control Track, due to its robust
Group’-component, which gives them a mutually nature, does not require to put audio equipment into
exclusive characteristic, meaning that only one data mode or non-audio mode to pass through. The
language can be played back at a time. Signal 13 is MPF contains all required metadata as well as the
an audio description track. These assignments and current set of dynamic metadata, such as object
the way these components are arranged in presets for positions, and can be cut at arbitrary frames. In a
easy ‘single-button-push’ user interactivity, are future IP-based production facility, the audio and
examples for structural metadata. The same elements metadata would be transmitted in a container over IP
of the scene also possess descriptive metadata, such according to e.g., SMPTE ST-2110.
as text labels, to identify a Component or Preset on a
TV user interface. MPEG-H Audio offers the 3.3 Live production
possibility to transmit multiple sets of labels in Live productions pose very specific requirements for
different languages. authoring tools, for instance real-time creation and
Other metadata include flags for the language of editing of metadata as well as loudness measurement
components, or mark a preset as “Audio and level control. At the same time, the engineer
Description”, “Hearing Impaired”, or “Emergency needs to be able to monitor the application of
Broadcast.” These settings can trigger certain metadata, to preview different downmixes and
behaviors in the playback device, e.g., selecting the interactivity options. These requirements resulted in
dialogue enhanced presentation by default or always the development of Audio Monitoring and
switching to the viewer’s preferred language if Authoring Units (AMAU), a device class tailored to
available. this exact application [15].
While these properties are actively assigned during An AMAU provides monitoring outputs and the
authoring, other datasets are created automatically audio and metadata stream which can be fed into a
by the production software. These include loudness contribution or emission encoder, including
information on Components and Presets and additional renderings of the immersive mix for
Dynamic Range Control (DRC) sets which can be legacy distribution (e.g., stereo or 5.1 surround).
triggered to optimize playback on different types of
devices or in various playback environments. 3.4 Post-production
3.2 MPEG-H Master In a post-production environment, software tools for
authoring offer a higher degree of flexibility than
After the authoring, audio and metadata are exported hardware-based solutions. The benefits range from
to an MPEG-H Audio Master. The MPEG-H Audio the use of a DAW timeline for automation to offline
Master contains linear PCM audio and can be exports instead of real-time playout.
archived, edited, and sent to an encoder. The most There are several authoring tools and plug-ins
common data formats for an MPEG-H Audio Master available, which are optimized for different
are MPF (MPEG-H Production Format) or ADM applications, such as immersive music or broadcast
(Audio Definition Model) in either BWF/ADM, production [4].
according to ITU-R BS.2076-2 [13], or S-ADM,
according to ITU-R BS.2125-0 [14].

AES 151st Convention, Online, October 2021


Page 3 of 5
Grewe et al. MPEG-H Audio Production Workflows

3.5 Conversion and ingest stems are generated along with accompanying
An MPF can also be converted to and from a metadata, including dynamic gains. At the user-end,
BWF/ADM representation of an MPEG-H Master. personalization is also available for legacy content.
The MPEG-H ADM Profile defines constraints on Dialog+ was recently employed in a large-scale field
ITU-R BS.2076 and ITU-R BS.2125 that enable test with German public broadcasters [18].
interoperability with established NGA content
production and distribution systems, meaning that 4 Delivery
e.g., Dolby Atmos BWF/ADM files can be Until the final encoding and delivery process, the
converted to MPEG-H BWF/ADM for feature audio is uncompressed PCM. Over an existing SDI-
enrichment and further distribution. based, future IP-based or file-based infrastructure,
the audio and metadata can be sent to an MPEG-H
3.6 Accessibility and Dialogue Enhancement enabled A/V-hardware- or software-encoder. The
Today’s broadcasting and streaming audiences are signals are efficiently compressed into an MPEG-H
very heterogenous. They span across a broad Audio bitstream and can be distributed over various
spectrum of ages, hearing and sight levels, language paths, like broadcast and streaming. Fig. 3 illustrates
skills, and cognitive functions [16]. Additionally, the production and delivery path of NGA compared
there are a number of different playback conditions, to legacy workflows.
which may affect accessibility. Reported difficulties
in following speech because of too loud background
sounds were documented 30 years ago [17] and are
still relevant [18].
Media presentation should therefore be adapted to
everyone’s needs and tailored to each individual
playback situation. This can be realized efficiently
with NGA systems.
MPEG-H Audio features the possibility of adjusting
the relative audio level of the dialogue, which is
particularly well received by the audience [18].
MPEG-H Audio can also deliver multiple language
versions, simplified speech, and audio descriptions
(AD) in one audio stream, as part of the regular Fig. 3. Legacy and NGA production and delivery.
broadcast. This has various advantages, such as easy
and efficient delivery of multiple languages and AD
with immersive sound quality and the saving of 5 Decoding and Rendering
transmission bandwidth. When delivered to the end-user’s device, the MPEG-
Taking full advantage of these capabilities requires H Audio bitstream is decoded and the audio signals
the use of separated elements as audio objects. New are rendered based on metadata and user interaction.
productions can take this into account, but a For channel-based components, a format converter is
workflow solution for large archives with existing used to generate high-quality renderings and convert
content (e.g., available only as stereo mix) also must the decoded channel signals to numerous output
be provided. For the important use case of formats, for example for playback on different
personalizing the dialogue level, this can be loudspeaker layouts, while preserving the original
achieved with deep-learning-based source separation timbre. Object-based signals are rendered to the
as done by the MPEG-H Dialog+ technology [18]. target reproduction loudspeaker layout by the object
Dialog+ contains a deep neural network trained for renderer, which maps the signals to loudspeaker
extracting dialogue from a legacy mixture. The feeds based on the metadata and the locations of the
separation network is combined with automatic loudspeakers in the reproduction room.
remixing so that the dialogue and the background

AES 151st Convention, Online, October 2021


Page 4 of 5
Grewe et al. MPEG-H Audio Production Workflows

Additionally, MPEG-H features binaural rendering. HFR, UHD, and NGA Sports Event”,
During the decoding/rendering process, the content SMPTE Annu. Tech. Conf. Exhibit (2018).
loudness is normalized based on metadata.
[7] Turnwald, A., et al. “Eurovision Song
Advanced loudness handling mechanisms are Contest 2018–immersive and interactive”,
employed to ensure consistent playback loudness, 30th Tonmeistertagung (2018).
even in case of user interaction.
[8] Grewe, Y., Scuda, U. “Comparison of main
6 Conclusions microphone systems for 3D-Audio
recordings”, 29th Tonmeistertagung (2016).
MPEG-H Audio provides a set of new features, such
as immersive sound, personalized audio, and [9] Lee, H., “Multichannel 3D Microphone
universal delivery, which have already been used in Arrays: A Review“, JAES Vol. 69, No. 1/2,
numerous production scenarios, for instance the pp. 5-26 (2021).
Eurovision Song Contest 2019, the Youth Olympic
[10] Hamasaki, K., et al. “Reproducing spatial
Games in 2020, or the 360 Reality Audio music impression with multichannel audio”, AES
service. These new features are based on 24th Conf. (2003).
accompanying metadata and create the need for new
production workflows. Numerous NGA production [11] Wittek, H., Theile, G. “Development and
aspects were described, focusing on recording and application of a stereophonic multichannel
mixing practices used in the creation of immersive recording technique for 3D Audio and VR”,
and interactive content. In addition to that, the AES 143rd Conv. (2017).
authoring step of MPEG-H was introduced, and the
[12] Bleidt, R., et al., “Development of the
delivery and rendering was described. MPEG-H TV Audio System for ATSC 3.0”,
IEEE Transactions on Broadcasting, Vol. 63,
References No. 1, pp. 202-236 (2017).
[1] ISO/IEC 23008-3:2019, "High efficiency
coding and media delivery in heterogeneous [13] ITU-R BS.2076-2, “Audio Definition Model”
environments–Part 3: 3D audio", incl. AMD (2019).
1:2019 "Audio metadata enhancements" and
AMD 2:2020, "3D Audio baseline profile, [14] ITU-R BS.2125-0, “A serial representation of
corrections and improvements." the Audio Definition Model” (2019).
[2] Grewe, Y., et al., “Producing Next [15] Poers, P., “Monitoring and Authoring of 3D
Generation Audio using the MPEG-H TV Immersive Next Generation Audio formats”,
Audio System”, BEITC at NAB (2018).
AES 139th Conv. (2015).
[3] Herre, J., et al. “MPEG-H Audio – The New
Standard for Universal Spatial / 3D Audio [16] Simon, C., et al. "MPEG-H Audio for
Coding”, AES 137th Conv. (2014). Improving Accessibility in Broadcasting and
Streaming", arXiv:1909.11549 (2019).
[4] Eibl, P., et. al., “Production Tools for the
MPEG-H Audio System”, AES 151st Conv. [17] Mathers, C. D., “A Study of Sound Balances
(2021). for the Hard of Hearing”, BBC White Paper
(1991).
[5] Kuwabara, H., et al, “Demonstration on
Next-Generation Immersive Audio in a Live [18] Torcoli, M., et al., “Dialog+ in Broadcasting:
Broadcast Workflow”, BEITC at NAB First Field Tests Using Deep-Learning-Based
(2020). Dialogue Enhancement”, Int. Broadcasting
Conv. (IBC) Technical Papers (2021).
[6] De Jong, F., et al, “European Athletics
Championships: Lessons from a Live, HDR,

AES 151st Convention, Online, October 2021


Page 5 of 5

You might also like