You are on page 1of 102

DISCRETE FOURIER TRANSFORM

FOURIER SERIES
Any periodic function can be expressed as the sum of a
series of sines and cosines (of varying amplitudes)
SQUARE WAVE
Frequencies: f Frequencies: f + 3f

Frequencies: f + 3f + 5f Frequencies: f + 3f + 5f + … + 15f


SAWTOOTH WAVE
Frequencies: f Frequencies: f + 2f

Frequencies: f + 2f + 3f Frequencies: f + 2f + 3f + … + 8f
FOURIER SERIES
A function f(x) can be expressed as a series of sines and cosines:
Transform means:
To change completely the appearance or character of something or
someone, especially so that the thing or person is improved.

In mathematics, a transformation is an invertible function f (usually with some


geometrical underpinning) that maps a set X to itself, or from X to another set Y
FOURIER TRANSFORM
Fourier Series can be generalized to complex numbers, and further generalized to
derive the Fourier Transform:
Forward Fourier Transform:

Inverse Fourier Transform:


FOURIER TRANSFORM
FourierTransform maps a time series (eg audio samples) into the series of
frequencies (their amplitudes and phases) that composed the time series.

InverseFourier Transform maps the series of frequencies (their amplitudes and


phases) back into the corresponding time series.

The two functions are inverses of each other.


DISCRETE FOURIER
TRANSFORM
Ifwe wish to find the frequency spectrum of a function that
we have sampled, the continuous Fourier Transform is not
so useful.
We need a discrete version:
Discrete Fourier Transform
DISCRETE FOURIER
TRANSFORM
Forward DFT:
The complex numbers f0 … fN are transformed
into complex numbers F0 … Fn

Inverse DFT:
The complex numbers F0 … Fn are transformed
into complex numbers f0 … fN
DFT EXAMPLE
 Interpretinga DFT can be slightly difficult, because the DFT of
real data includes complex numbers.
 Basically:

 The magnitude of the complex number for a DFT component is


the power at that frequency.
 The phase θ of the waveform can be determined from the
relative values of the real and imaginary coefficients.
 Also both positive and “negative” frequencies show up.
Sampled data:
f(x) = 2 sin(x) + sin(3x)`

DFT: Real Components

DFT: Imaginary Components

“Negative” Frequencies

DFT: Magnitude
Sampled data:
f(x) = 2 sin(x+45) + sin(3x)

DFT: Real Components

DFT: Imaginary Components

DFT: Magnitude
FAST FOURIER
TRANSFORM
 Discrete Fourier Transform would normally require O(n2) time to process for n samples:

 Usually, doesn’t get calculated this way in the practice.


Fast Fourier Transform takes O(n log(n)) time.
Most common algorithm is the Cooley-Tukey Algorithm.
FOURIER COSINE TRANSFORM

Any function can be split into even and odd parts:

Then the Fourier Transform can be re-expressed as:


Even: f(x) = f(-x) Odd: f(x) = -f(-x)
•The discrete Fourier transform (DFT) transforms a complex
signal into its complex spectrum.
•However, if the signal is real as in most of the
applications, half of the data is redundant.
•In time domain, the imaginary part of the signal is all
zero;
•In frequency domain, the real part of the spectrum is
even symmetric and imaginary part is odd.
• This means that the real part is even and the imaginary part is odd, or
equivalently that the magnitude is even and the phase is odd. Thus, the
information contained in the negative frequencies is a duplication of the
information contained in the positive frequencies.
•In comparison, Discrete cosine transform (DCT) transforms is a real transform that
transforms a sequence of real data points into its real spectrum and therefore avoids
the problem of redundancy.
•Also, as DCT is derived from DFT, all the desirable properties of DFT (such as the fast
algorithm) are preserved.
•When the input data contains only real numbers from an even function, the sin
component of the DFT is 0, and the DFT becomes a Discrete Cosine Transform
(DCT)
•There are 8 variants however, of which 4 are common.
DCT VS DFT

For compression, we work with sampled data in a finite time window.


Fourier-style transforms imply the function is periodic and extends to infinity. So which
periodic extension is best?

periodic Sampled periodic


Extension data extension less discontinuity

discontinuity
DCT TYPES
DCT Type II

Most common variant, also called simply "the DCT”

Used in JPEG, repeated for a 2-D transform.


DCT TYPES
DCT Type III

is correspondingly often called simply "the inverse DCT" or "the


IDCT".
DCT TYPES
DCT Type IV
Used in MP3.

In MP3, the data is overlapped so that half the data from
one sample set is reused in the next.
Known as Modified DCT or MDCT (multidimensional DCT)

This reduces boundary effects.


WHY DO WE USE DCT FOR
MULTIMEDIA?
For audio:
Human ear has different dynamic range for different frequencies.
Transform to from time domain to frequency domain, and quantize different
frequencies differently.
For images and video:
Human eye is less sensitive to fine detail.
Transform from spatial domain to frequency domain, and quantize high
frequencies more coarsely (or not at all)
Has the effect of slightly blurring the image - may not be perceptible if done
right.
WHY USE DCT/DFT?

 Some tasks are much easier to handle in the frequency domain that in the
time domain.
 Eg: graphic equalizer. We want to boost the bass:
1. Transform to frequency domain.
2. Increase the magnitude of low frequency components.
3. Transform back to time domain.
JPEG Noise from
discarding high
frequency DCT
Coefficients

Original Image

JPEG Image
MULTIMEDIA PROCESSING IN
COMMUNICATION
ADVANTAGES OF DIGITAL MEDIA
Robustness
Seamless integration
Reusability and interchangeability
Ease of distributed potential
DIGITAL IMAGE
Can be captured by,
 Camera
 Photography by scanner
Digital images are composed of a collection of pixels arranged in
a 2D matrix, called image resolution.
Pixel consist of: R, G, B components
The no of bits to represent a pixel is called color depth, which
decide actual no of colors available to represent a pixel.
Resolution and color depth determines presentation quality and
size of image storage.
More pixels  more color better quality larger size
To reduce the storage , three approaches are used:
1. Index color
 This approach reduces the storage size by using a limited no of bits with a color lookup
table to represent a pixel. Dithering can be used to create additional colors by blending
colors from the palette.
2. Color subsampling
 Human perceive color brightness, hue and saturation rather than RGB components. To take
advantage of such mechanism of the human eye, light can be separated into luminance
and chrominance instead of RGB components.
 This reduces the size by using less bits to represent the chrominance components while
having the luminance component unchanged.
3. Spatial reduction
 It is known as data compression, reduces the size by throwing away the spatial redundancy
within images.
DIGITAL VIDEO
Video is a series of still image frames
Anything more than 20 fps is smooth to Human Visual System (HVS)
Biggest challenge:
 Massive volume of data involved
 To meet real time constraints on retrieval, delivery and display
Solution:
 Compromise in the presentation quality,
 Video compression
DIGITAL AUDIO
It is designed to make use of the range of human hearing.
Two important considerations:
 Frequency response
 Dynamic range

Frequency response refers to the range of frequencies that a medium can


reproduce accurately. Human hearing frequency range is 20Hz to 20 KHz.
Dynamic range describes the spectrum of the softest to the loudest sound
amplitude levels that a medium can reproduce.
 Human hearing can accommodate a dynamic range greater than a factor of millions.
 Human perceive sound across the entire 120 dB, the upper limit of which will be painful to
human.
Sound characterized in frequency (Hz), amplitude (dB) and phase
(degree)
Ex:
 The sampling rate of CD quality is 44.1 KHz. Thus it can accommodate the
highest frequency of human hearing 20KHz.
 Telephone quality sound adopts 8KHz sampling rate, which can
accommodate hearing up to 4KHz
If one attempts to record frequencies that exceed half the
sampling rate will results in digital audio aliasing.
Figure shows the digital audio signal processing

• The quality of digital audio is characterized by the sampling rate, quantization


interval and the number of channels.

• The higher the sampling rate, the more the bits per sample and the more channels
means the higher the quality of Audio and higher the storage and bandwidth
requirements.
A 44.1KHz sampling rate, 16-bit quantization and stereo audio
reception produce CD quality audio, requires BW of =
44,100*16*2 = 1.4 Mb/s
Telephone quality audio has, 8KHz of sampling rate, 8-bit
quantization, and mono audio reception requires BW of =
8000*8*1 = 64Kb/s
Integrated media systems will only achieve their potential if they are truly integrated
in three key ways: integration of content, integration with human users and
integration with other media systems.
Integration of content: such systems must successfully combine digital video and
audio, text, animation and graphics , etc..
integration with human users: they must integrate with the individual user by
cooperative interactive multidimensional dynamic interfaces.
integration with other media systems: integrated media systems must connect with
other such systems
SIGNAL PROCESSING ELEMENTS
Many classical signal processing methods have been developed. But the
key driver is the optimization for representing such multimedia
components, associated storage and delivery requirements.
The optimization procedures ranges from very simple to sophisticated
like,
 Nonlinear analog (video and audio) mapping
 Quantization for analog signal
 Statistical characterization
 Motion representation and models
 3D representations
 Color processing
CHALLENGES OF MULTIMEDIA INFORMATION
PROCESSING
•Novel communications and networking are critical for a multimedia database
to support interactive dynamic interface.
•A truly integrated media system must connect with individual users and content
addressable multimedia database.
•Multimedia systems must successfully combine digital video and audio, text
animation, graphics and knowledge about such information units and their
interrelationships in real time.
•The operations of filtering, sampling, spectrum analysis and signal
representation are basic for all signal processing.
•Understanding these operations in the multidimensional (mD) has a major
activity
• Algorithms for processing mD signals can be grouped into four
categories
• 1 Separable algorithms that use 1D operators to process the rows
and columns of a multidimensional array
• 2 Non-separable algorithms that borrow their derivation from
their 1D counterpart
• 3 mD algorithms that are significantly different from their 1D
counterparts
• 4 mD algorithms that have no 1D counterparts
SEPARABLE ALGORITHMS
•operate on the rows and columns of an mD signal sequentially.
•widely used for image processing because they invariably require
less computation than non-separable algorithms
• examples of separable procedures include mD DFT, DCTs ,FFT --
based spectral estimation using the periodogram
•separable Finite Impulse Response (FIR) filters can be used in sep-
arable filter banks, wavelet representations for mD signals and
decimators and interpolators for changing the sampling rate.
NON-SEPARABLE ALGORITHMS
•They are uniquely mD in that they cannot be decomposed into a
repetition of 1D procedures.
•These can usually be derived by repeating the corresponding 1D
derivation in an mD setting.
•Ex: Upsampling and downsampling
•There are also mD algorithms that have no 1D counterparts,
especially algorithms that perform inversion and computer imaging.
•One of these is the operation of recovering an mD distribution
from a finite set of its projections, equivalently inverting a
discretized Radon transform.
•This is the mathematical basis of computed tomography (Tomography
is imaging by sections or sectioning, through the use of any kind of penetrating wave.)
and positron emission tomography.
•Another imaging method, developed first for geophysical
applications, is Fourier integration.
•Finally signal recovery methods unlike the 1D case are possible.
•The mD signals with finite support can be recovered form the
amplitudes of their Fourier transform or from threshold crossings.
In graphics, (computer generated images) all objects are made up of a series of lines
that are connected to each other either by lines, curves, etc..
FACSIMILE MACHINE
PRE AND POST PROCESSING
the hardware available to capture the data should be cheap,
affordable for a large number of users.
It is mandatory to use a preprocessing step prior to coding in
order to enhance the quality of the final pictures and to remove
the noise that will affect the performance of compression
algorithms.
Many solutions have been proposed in the field of imaging.
A more appropriate approach would be to identify the charac-
teristics of the coding scheme when designing such operators.
For example, mobile a widely used device.
 Such devices are usually subject to different motions, such as tilting and jitter,
translating into a global motion in the scene due to the motion of the camera.
 Here, the pre and post processing plays an important role.
It is normal to expect a certain degree of distortion of the
decoded images for very low-bit rate applications.
An additional stage could be added to reduce the distortion
further due to compression as a postprocessing operator
There are solutions to improve the effects of ringing,
blurring and mosquito noise , etc.
Recently, advances in postprocessing mechanisms have
improve lip synchronization of head-and-shoulder
video coding at a very low bit rate by using the
knowledge of decoded audio in order to correct the
positions of the lips of the speaker
SPEECH, AUDIO AND ACOUSTIC PROCESSING
Primary advances in speech and audio signal processing are:
Speech and audio signal compression
Speech synthesis
Acoustic processing and echo cancellation
Network echo cancellation
 Speech and audio signal compression
 Aims at efficient digital representation and reconstruction of speech and audio
signals for storage and playback as well as transmission
 Various techniques for signal analysis and compression have been applied to
achieve excellent speech quality even at less than 8kbps, which forms the basis for
cellular as well as Internet telephony
 Speech synthesis
 includes generation of speech from unlimited text, voice conversion and
modification of speech attributes such as time scaling and articulatory mimic
 Key problems is conversion of text into a sequence of speech inputs, methods to
concatenate and reconstruct the sound waveform
Acoustic processing and echo cancellation
 Sound pickup and recording is an important area
 Sound recoding, interference (ex., ambient noise and reverberation) degrade
the quality
 Acoustic processing and echo cancellation includes the modelling of
reverberation, design of dereverberation algorithm, echo suppression,
double talk detection and adaptive echo cancellation
 Network echo cancellation
 In telephony, due to hybrid coil for two to four wire conversions, there are
near and far echo exist.
 There are adaptive echo cancellation algorithms have been developed.
Video signal processing
• it has many advantages over conventional analog video, including bandwidth compression,
robustness against channel noise interactivity and ease of manipulation.
• digital video signal has many formats:
• ex. Broadband TV signal are digitized in ITU-R 601 format which has: 30/25 fps, 720 pixels by 488 lines per
frame, 2: 1 interlaced, 4:3 aspect ratio, and 4:2:2 chroma sample.

TV and PC industries have resulted in the approval of 18 different digital video formats in the
United States. Exchange of video signals between TV and PCs requires effective format
conversion.
There are conversion methods are available. Ex. SIF: source input format, motion-adaption field-
rate doubling and deinterlacing, motion compensated frame rate conversion.
Video signals suffer from several degradation and artifacts.
Some of them are objectionable for freezeframe or printing from video
applications.
Some filters are adaptive to scene content to preserve spatial and temporal
edges while removing the noise.
Examples of edge preserving filters are: median, weighted
median, adaptive linear mean square error and adaptive
weighted-averaging filtering.
Deblocking filters can be classified
1. Require a model of the degradation process (inverse, constrained, least square
and Wiener filtering) and
2. Do not require a model of the degradation (contrast adjustment by histogram
specification and unsharp masking). Smooth intensity variations, high resolution
reference image, reconstruction image, etc..
Another challenge is to decompose a video sequence into its elementary
parts, like synthetic or natural visual objects, finding shot boundaries, spatial
segmentation , object tracking, with that 2D and 3D representation.

Storage and archiving of digital video in shared disks and servers in large
volumes, browsing of such databases in real time and retrieval across switched
and packet networks pose many new challenges.
The simplest method to index content is by assigning
manually or semi-automatically the content to programs,
hots and visual objects.
It is of interest to browse and search for content using com-
pressed data because almost all video data will likely be
stored in compressed format.
What is video indexing?
Video indexing is the process of providing watchers a way to
access and navigate contents easily; similar to book indexing.
The selection of indexes derived from the content of the video to
help organize video data and metadata that represents the
original video stream
Video indexing system may employ frame based, content
based or object-based.
The basic components of a video indexing systems are
temporal segmentation: it extracts shots, scenes and/or video
objects
analysis of indexing features: computes content-based indexing
features for the extracted shots, scenes or objects.
visual summarization: are story boards, visual posters and
mosaic-based summaries.
CONTENT BASED IMAGE RETRIEVAL
Multimedia signal-processing methods must allow efficient access to processing and
retrieval of content in general, and visual content in particular.
Applications: medicines, entertainment, consumer industry, broadcasting, journalism,
art and e-commerce.
Signal processing, pattern recognition; computer vision, database organization,
human-computer interaction and psychology, must contribute to achieving the image
retrieval goal.
Image retrieval methods face several challenges when addressing this goal.
Example:
To improve performance and address these problems, content-based image retrieval methods
have been proposed which focus on feature extraction automatically or semi-automatically
CONTENT BASED IMAGE RETRIEVAL
•Texture based methods
•Shape based methods
•Color based methods
TEXTURE BASED METHODS
Methods based on
spatial frequencies: evaluate the coefficients of the autocorrelation function of the
texture
co-occurrence matrixes: identify repeated occurrences of gray level pixel
configuration within the texture
multiresolution methods: methods describe the texture characteristics at coarse-to-
fine resolution
have been frequently employed for texture description because of their efficiency.
A major problem: sensitivity to scale, that is, the texture characteristics may disappear
at low resolutions or may contain a significant amount of noise at high resolutions
Skin texture based
SHAPE BASED METHOD
Describing quantitatively the shape of an object is a difficult task.
Several contour-based and region-based shape description methods have been
proposed.
1. Contour based
• Chain codes, geometric border representations, Fourier transforms of the boundaries,
polygonal representations and deformable (active) model

2. Region based
• scalar region descriptors, moments, region decompositions and region neighborhood graph

• The main problems that are associated with shape description methods-are
high sensitivity to scale, difficult shape description of objects and high
subjectivity of the retrieved shape results
COLOR BASED METHODS
Three description methods:
 Color histogram based : use a quantitative representation of
the distribution of color intensities.
 Dominant color based: use a small number of color ranges to
construct an approximate representation of color distribution
 Color moment base: statistical measures of the image
characteristics in terms of color.
The performance of these methods depends on: color space,
quantization and distance measure for evaluation of the
retrieved results.
Limitation of histogram based and dominant color based – inability
to allow the localization of an object in an image
Solution to it is – color segmentation
The performance of these methods depends on: color space,
quantization and distance measure for evaluation of the
retrieved results.
Limitation of histogram based and dominant color based – inability
to allow the localization of an object in an image
Solution to it is – color segmentation
But the limitation – complexity
Some or all of the limitations of these systems are the following:
• Few query types are supported
• Limited set of low- level features
• Difficult access to visual objects
• Results partially match user's expectations
• Limited interactivity with the user
• Limited system interoperability
• Scalability problems
PERCEPTUAL CODING OF DIGITAL AUDIO
SIGNALS
Perceptual audio coding is a compression technology for audio
signals that is based on imperfections of the human ear.
It is a lossy compression technique
The task of the perceptual audio coding technique is to have a
decoded bitstream that sounds exactly (or at least as close as
possible) as the original audio whilst keeping the compressed
file as small as possible.
GENERAL PERCEPTUAL AUDIO CODING
ARCHITECTURE It
The
can
and
transforms
be
input
time-frequency
time-frequency
quantized
stationary
analysis properties
frames
spectral
audio
coders typically
and
ranging
of the
components
into
mapping
segment
analysis ais
section
encoded
human
from
of each
set
input of parameters
usually
signals
matched
estimates
2according
to
auditory
into which
quasi-
to the
the temporal
to a perceptual
50 ms.system
frame.
distortion metric.
the time-frequency analysis section may contain the following.
• Unitary transform
• Time-invariant bank of uniform bandpass filters
• Time-varying, critically sampled bank of nonuniform
bandpass filters
• Hybrid transform/filter bank signal analyser
• Harmonic/sinusoidal analyser
• Source-system analysis (LPC/multipulse excitation).
Time frequency analysis involves a fundamental tradeoff between time and
frequency resolution requirements.
Perceptual distortion control is achieved by a psychoacoustic signal analysis that
estimates signal masking power based on psychoacoustic principles.

Masking is the process by which the


detection threshold of a sound/
the signal is increased by the presence
of another sound (called 'the masker’).

The amount of masking is defined as the


increase (in decibels) in the detection
threshold of a sound (signal) due to the
presence of a masker sound
WHAT IS PSYCHOACOUSTICS?
Psychoacoustics is the science of sound perception, i.e.,
investigating the statistical relationships between acoustic stimuli
and hearing sensations.
The psychoacoustic model delivers masking thresholds that quantify
the maximum amount of distortion that can be injected at each point
of the time-frequency plane during quantization and encoding
without introducing audible artifacts in the reconstructed signal.
Therefore, the psychoacoustic model allows the quantization and
encoding section to exploit perceptual irrelevancies in the time-
frequency parameter set and statistical redundancies through DPCM
and ADPCM.
Quantization might be performed on scalar or vector quantities.
After that remaining redundancies are removed through Run Length
Coding and entropy coding techniques like, Huffman, arithmetic,
etc..
REVIEW OF PSYCHOACOUSTIC FUNDAMENTALS
Audio-coding algorithms must rely upon generalized receiver models to
optimize coding efficiency.
In the case of audio,
the receiver is ultimately the human ear, and so und perception is
affected by its masking properties.
Most current audio coders achieve compression by exploiting the fact
that irrelevant signal information is not detectable by even a well-trained
or sensitive listener.
Irrelevant information is identified during signal analysis are,
• Absolute threshold of hearing
• Critical band frequency analysis
• Simultaneous masking and the spread of masking
• Temporal masking
with the basic properties of signal quantization has also led to the
development of PE (Perceptual Entropy), a quantitative estimate
of the fundamental limit of transparent audio-signal
compression
ABSOLUTE THRESHOLD OF HEARING
It is characterized by the amount of energy needed in a pure tone such that it
can be detected by a listener in a noiseless environment.
 In hearing, the absolute threshold refers to the smallest level of a tone that can be detected by
normal hearing when there are no other interfering sounds present.

The frequency dependence of this threshold was quantified with a large number of
listeners.
The quiet threshold is well approximated by the nonlinear function
𝑓 −0.8 𝑓 2 𝑓 4
𝑇𝑞 𝑓 = 3.64 − 6.5𝑒 −0.6 − 3.3 + 10−0.3 [𝑑𝐵 𝑆𝑃𝐿]
1000 1000 1000
Where, SPL = Sound Pressure Level
This is representative of a young listener with acute hearing
Tq(f) can be interpreted as a maximum allowable energy level for coding distortions
introduced in the frequency domain.
Algorithm designers have no a priori knowledge regarding actual playback levels.

Sound Pressure Level


(SPL) curve is often
referenced to the
coding systems by
equating the lowest
point on the curve (that
is, 4 KHz) to the energy
in +/-1 bit of signal
amplitude.
CRITICAL BAND FREQUENCY ANALYSIS
Using the absolute threshold of hearing it is required to
shape the coding-distortion spectrum for the
perceptual coding.
Consider how the ear actually does the spectral
analysis.
A frequency-to-place transformation takes place in the
inner ear along the basilar membrane.
Distinct regions in the
cochlea, each with a
set of neural
receptors, are tuned
to different frequency
bands.
In the experimental sense, critical
bandwidth can be loosely
defined as the bandwidth at
which subjective responses
change abruptly.
The human ear responds to
frequency content using what
are known as critical bands.
These are narrow bandwidth
divisions of the 20Hz-20kHz
frequency spectrum.
If a loud frequency component
exists in one of these critical
bands, that noise creates
a masking threshold that will
reduce quieter frequencies in the
same critical band potentially
imperceptible.
SIMULTANEOUS MASKING AND THE SPREAD OF
MASKING
Masking refers to a process where one sound is rendered
inaudible because of the presence of another sound.
Simultaneous masking refers to a frequency-domain
phenomenon that has been observed within critical bands.
It has two types.
tone-masking noise
noise-masking tone
TONE-MASKING NOISE
a tone occurring at the center of a critical band masks noise of any subcritical
bandwidth or shape, provided the noise spectrum is below a predictable
threshold directly related to the strength of the masking tone.
The presence of a strong noise or tone masker creates an excitation of
sufficient strength on the basilar membrane at the critical band location to
block transmission of a weaker signal effectively.
Interband masking has also been observed.
It means that a masker centered with one critical band has some predictable
effect on detection thresholds in other critical bands. This effect, also known as
the spread of masking.
Spreading function is defined as,
𝑆𝐹𝑑𝐵 𝑥 = 15.81 + 7.5 𝑥 + 0.474 − 175 1 + 𝑥 + 0.474 2 dB
Where, x = unit of barks
The Bark scale is a psychoacoustical scale proposed by Eberhard Zwicker in 1961.
a frequency scale on which equal distances correspond with perceptually equal
distances. Above about 500 Hz this scale is more or less equal to a logarithmic
frequency axis. Below 500 Hz the Bark scale becomes more and more linear.
The scale ranges from 1 to 24 and corresponds to the first 24 critical bands of
hearing
Schematic representation of
After critical band analysis and the spread of
simultaneous masking
function accounted, masking thresholds in
psychoacoustic coders are established as,
Noise masking threshold:
𝑇𝐻𝑁 = 𝐸𝑇 − 14.5 − 𝐵
Tone masking threshold:
𝑇𝐻𝑇 = 𝐸𝑁 − 𝐾
Where, EN and ET are critical band noise and tone-
masker energy levels,
B = critical band number
K = parameter selected between 3 to 5 dB
depending on an algorithm. Masking thresholds are
commonly referred to Just Noticeable -Distortion
(JND).
Signal to Noise Ratio (SNR) and Noise-to-Mask Ratio
(NMR) denote the log distances from the minimum
masking threshold to the masker and noise levels,
respectively.
TEMPORAL MASKING
Temporal masking refers
to masking that occurs when a
signal and a masker are not
presented simultaneously.
Backward masking occurs when
the masker follows the signal;
forward masking occurs when
the masker precedes the signal.

Premasking tends to last only


about 5 ms. Postmasking will
extend anywhere from 50 to 300
ms, depending upon the strength
and duration of the masker
Temporal masking has been used in several audio-coding
algorithms.
premasking has been exploited in conjunction with adaptive block-
size transform coding to compensate for pre-echo distortions.

How to measure such perceptually relevant information?


The answer is Perceptual Entropy
PERCEPTUAL ENTROPY
This is a measure of perceptually relevant information contained in any audio
record. It is measured in bits per sample.
PE represents a theoretical limit on the compressibility of a particular signal.
The process flows like this:

The signal is first A masking Determine the


windowed and threshold is then number of bits
transformed to obtained using required to
the frequency perceptual quantize the
domain. rules. spectrum
CALCULATING THE PE
Step 1: real and imaginary components are converted to power
spectral components.

Step 2: a discrete bark spectrum is formed by summing the energy in


each critical band

where the summation, limits are the critical boundaries (bl = bandlow, bh = bandhigh).
The range of the index, i, is sampling rate dependent, and in particular for i є { l , 25}
Step 3: A basilar spreading function is then convolved with the
discrete bark spectrum to account for interband masking

Step 4: An estimation of the tone line or noise like quality for Ci is


then obtained using the spectral flatness measure (SFM)

where Mg and Ma correspond to geometric and arithmetic means


of the power spectral density components of each band.
The SFM has the property that it is bounded by 0 and 1.
Values close to 1 will occur if the spectrum is flat in a particular
band, indicating a decorrelated (noisy) band.
Values close to zero will occur if the spectrum in a particular band
is nearly sinusoidal.
A coefficient of tonality, α , is next derived from the SFM on a dB
scale, that is

This is used to weigh the thresholding with K = 5.5 as follows for


each band to form an offset.
A set of JND estimates in the frequency power domain are then formed
by subtracting the off sets from the bark spectral components.

These estimates are scaled by a correction factor to simulate


deconvolution of the spreading function.
Then each Ti is checked against the absolute threshold of hearing and
replaced by max(Ti ,TABS(i))
the absolute threshold is referenced to the energy in a 4 KHz sinusoid of +/-1 bit
amplitude.
By applying uniform quantization principles to the signal and associated set of JND
estimates, it is possible to estimate a lower bound on the number of bits required to
achieve transparent coding.
The perceptual entropy in bits per sample is represented by

Where,
i = the index of critical band,
bli and bhi = the upper and lower bounds of band i,
ki = is the number of transform components in band i,
Ti = is the masking threshold in band i
int denotes rounding to the nearest integer.
TRANSFORM AUDIO CODERS

You might also like