You are on page 1of 42

MPEG-4 AUDIO STANDARD

1. ng B Vit
2. Ng t
3. Bi nh Phc

11DT3
11DT3
11DT2

BROAD OUTLINE

INTRODUCTION
FUNDAMENTALS
MPEG-4 GENERAL AUDIO CODING
PERFORMANCE MEASURES
EVALUATION
CONCLUSION

INTRODUCTION

New kind of audio coding standard


Can be grouped into two basic kinds
General audio (GA) coding
Synthetic audio
We focus on general audio coding
GA coding: taking PCM audio streams and effectively encoding
them for transmission and storage

FUNDAMENTALS

Psychoacoustics

Frequency masking
Threshold of hearing
Critical bands
Bark units
Temporal masking

MPEG Audio

MPEG Layers
MPEG Audio strategy
MPEG Audio compression algorithm
Bit-allocation

PSYCHOACOUSTICS

Range of human hearing: 20 Hz to 20 kHz

Voice: 500 Hz to 4 kHz

Range of hearing depends upon frequency


Fletcher-Munson equal-loudness curves
Describe relationship between perceived loudness for a given
stimulus sound volume as a function of frequency

FLETCHER-MUNSON EQUAL-LOUDNESS
CURVES

FREQUENCY MASKING

A lower tone can effectively mask a higher tone


A higher tone does not mask a lower tone well. Tones can mask
lower-frequency sounds, but not as effectively as they mask
higher frequency ones
The greater the power in the masking tone, the broader the
range of frequencies it can mask
If two tones are widely separated in frequency, little masking
occurs

THRESHOLD OF HEARING

THRESHOLD OF HEARING

[dB]
Theoriginisatthefrequencyof2kHzsinceThreshold(f)=0atf=2kHz

FREQUENCY MASKING CURVES

FREQUENCY MASKING CURVES

CRITICAL BANDS

Human hearing range naturally divides into critical bands


Human auditory system cannot resolve sounds better than
within about one critical band when other sounds are present
Critical bandwidth corresponds to the smallest frequency
difference between two partials such that each can still be heard
separately
Critical band
Less than 100 Hz at f < 500 Hz nearly constant
For f 500 Hz, increases roughly linearly with frequency

Audio frequency range for hearing can be partitioned into about


24 critical bands (25 are typically used for coding applications)

CRITICAL BANDS

BARK UNITS

The range of frequencies affected by masking is broader for


higher frequencies
It is useful to define a new frequency unit
In terms of this new unit, each of the masking curves has about
the same width
The new unit defined is called the Bark, named after Heinrich
Barkhausen (1881-1956)
One Bark unit corresponds to the width of one critical band, for
any masking frequency

BARK UNITS

The conversion between a frequency f and its corresponding

critical band number b, expressed in Bark units


Another formula:
where f is in kHz, b is in Barks

BARK UNITS

The inverse equation gives the frequency in kHz corresponding

to a particular Bark value b


The critical bandwidth (df) for a given center frequency f can
also be approximated by
where f is in kHz and df is in Hz
The idea of the Bark unit is to define a more perceptually
uniform unit of frequency, in that every critical bands width is
roughly equal in terms of Barks

TEMPORAL MASKING

Any loud tone causes the hearing receptors in the inner ear to
become saturated, and they require time to recover

It may take up to as much as 500 ms for us to discern a quiet


test tone after a 60 dB masking tone has been played

TEMPORAL MASKING

The closer the frequency to the masking tone and the closer in time to when the
masking tone is stopped, the greater likelihood that a test tone cannot be heard

TEMPORAL MASKING

The phenomenon of saturation also depends upon how long the


masking tone is applied

Solid curve: masking tone played for 200 ms


Dashed curve: masking tone played for 100 ms

TEMPORAL MASKING

Post-masking: a signal is able to mask other signals that occur


just after it sounds
Pre-masking: a particular signal can mask sounds played just
before the stronger signal
Pre-masking has a shorter effective interval (2-5 ms) than does
post-masking (usually 50-200 ms)

MPEG audio compression takes advantage of these


considerations in basically constructing a large,
multidimensional lookup table
It uses this to transmit frequency components that are masked
by frequency masking or temporal masking or both, using few
bits.

MPEG AUDIO

MPEG Audio proceeds by first applying a filter bank to the input,


to break the input into its frequency components
In parallel, it applies a psychoacoustic model to the data, and
this model is used in a bit-allocation block
Then the number of bits allocated is used to quantize the
information from the filter bank
The overall result us that quantization provides the
compression, and bits are allocated where they are most needed
to lower the quantization noise below an audible level

MPEG LAYERS

Three downward-compatible layers of audio compression


Each offers more complexity in the psychoacoustic model
applied and correspondingly better compression for a given
level of audio quality
Layer 1 quality can be quite good, provided a comparatively high
bitrate is available
Layer 2 has more complexity and was proposed for use in digital
audio broadcasting
Layer 3 is most complex and was originally aimed at audio
transmission over ISDN lines
Each of the layers uses a different frequency transform

MPEG AUDIO STRATEGY

The encoder employs a bank of filters that act to first analyze the
frequency components of the audio signal
Using a psychoacoustic model to estimate the masking
threshold level
In quantization and coding stages, the encoder balances the
masking behavior and the available number of bits by discarding
inaudible frequencies and scaling quantization according to the
sound level left over, above masking levels

MPEG AUDIO COMPRESSION ALGORITHM

Ancillary Data
(Optional)

MPEG AUDIO COMPRESSION ALGORITHM

MPEG AUDIO COMPRESSION ALGORITHM

SBS: Sub-band samples


Header contains
synchronization code (twelve 1s - 111111111111)
sampling rate used
bit-rate
stereo information

Ancillary data: i.e. multilingual data and surround-sound data


SBS format: quantized scaling factor and code-words

MPEG AUDIO COMPRESSION ALGORITHM

Filter bank: divides the input into 32 frequency sub-bands


Determine the scaling factor for each sub-band
Pass the scaling factor along with sub-band samples (SBS) to
the bit-allocation block
Determine the number of code bits to quantize the sub-band to
minimize the audibility of quantization noise
Psychoacoustic model
decides whether each frequency sub-band is tone-like or
noise-like
based on that decision and scaling factor, calculates the
masking threshold for each band and compares with the
threshold of hearing

BIT-ALLOCATION

The bit-allocation is not part of the standard, and it can therefore


be done in many possible ways
Ensure that all the quantization noise is below the masking
threshold

BIT-ALLOCATION

MPEG-4 GENERAL AUDIO CODING

Being built around the coder kernel provided by MPEG-2


Advanced Audio Coding (AAC)
With some additional coding tools and coder configurations
The perceptual noise substitution (PNS) tool and the Long-Term
Prediction (LTP) tool are available to enhance the coding
performance for the noise-like and very tonal signal, respectively
A special coder kernel (Twin VQ) is provided to cover extremely
low bitrates
A flexible bitrate scalable coding system is defined including a
variety of possible coder configurations

MPEG-4 GENERAL AUDIO CODING

LONG TERM PREDICTION

Long-term prediction (LTP) is an efficient tool for reducing the


redundancy of a signal between successive coding frames
newly introduced in MPEG-4
LTP tool provides optimal coding gain for stationary harmonic
signals as well as some gain for non-harmonic tonal signals
Compared with the rather complex MPEG-2 AAC predictor tool,
the LTP tool shows a saving of roughly one-half in both
computational complexity and memory requirements

LONG TERM PREDICTION

PERCEPTUAL NOISE SUBSTITUTION (PNS)

A feature newly introduced into MPEG-4 GA


Aiming at a further optimization of the bit-rate efficiency of AAC
at lower bit-rates
The technique of PNS is based on the observation that one
noise sounds like the other
Additional decoder complexity associated with the PNS coding
tool is very slow in terms of both computational and memory
requirements

PERCEPTUAL NOISE SUBSTITUTION

TWIN VQ

To increase coding efficiency for coding of musical signals at


very low bit-rates
The basic idea is to replace the conventional encoding of scalefactors and spectral data used in MPEG-4 AAC by an interleaved
vector quantization applied to a normalized spectrum
The input signal vector (spectral coefficients) is interleaved into
sub-vectors. These sub-vectors are quantized using vector
quantizers. TwinVQ can achieve a higher coding efficiency at the
cost of always creating a minimum amount of loss in terms of
audio quality

TWIN VQ

MPEG-4 SCALABLE GA CODING

Enable the transmission and decoding of the bit-stream with a


bit-rate that can be adapted to dynamically varying requirements
Offer significant advantages for transmitting content over
channels with a variable channel capacity or connections for
which the available channel capacity is unknown, at the time of
encoding

MPEG-4 SCALABLE GA CODING

Base layer: transmits the most relevant components of the


signal at a basic level quality
Enhancement layers: enhance the coding precision delivered by
the preceding layers

MPEG-4 SCALABLE AUDIO CODER

Base layer coder operates at a lower sampling frequency than


the enhancement layer coder

CONCLUSION

Based in part on MPEG-2 AAC, in part on conventional speech


coding technology, and in part in new methods, the MPEG-4
General Audio coder provides a rich set of tools and features to
deliver both enhanced coding performance and provisions for
various types of scalability. The MPEG-4 GA coding defines the
current state of the art in perceptual audio coding

You might also like