You are on page 1of 17

MMA Theory Assignment

1. Critique time and frequency domain feature pipeline with


diagram?
Solution:
Time domain feature pipeline:
The extraction pipeline for time domain audio features involves several steps to extract
features from the raw audio signal in the time domain. Here is an overview of the typical
pipeline:
 Analog to digital: First step in time domain feature extraction pipeline for audio
is, audio needs to be converted from analog to digital format. For this analog to
digital converter (ADC) is used, which means signal should be sampled and
quantized so that we get a digital signal.

 Framing: Once digitized signal is obtained, next step is framing i.e., bundle
together a bunch of samples. As in the below figure, sample 1 to sample 128
bundled together, sample 64 to sample 192 bundled together and so on. Here the
frames are overlapped.
Frames can be thought as audio chunks that are perceivable things. One sample at
sampling rate of 44.1KHz, which is sampling rate of CD-ROM. This sample has
duration of 0.0227ms, which is less than ear’s time resolution.
Duration of one sample << ear’s time resolution (10ms).
To speed up this process, use power of 2 number samples. Typical value of frames
oscillates between 256 – 8192.
Duration of frame is,

Where, Sr is sampling rate = 44.1KHz and K is frame size = 512


So, df = 11.6ms.
This duration is above the ear’s time resolution.

 Feature Computation: After framing, next step is feature computation. Compute


features on each of different frames.

 Aggregation: After feature computation, next step is aggregation. Aggregate the


features to get single feature vector for the whole sound.
Aggregation can be done by using statistical mean, median, sum, gaussian mixture
models (GMM).

As a result of aggregation, final value is obtained which can be feature vector or feature
matrix.
Frequency domain feature pipeline:
The extraction pipeline for frequency domain audio features involves several steps to
extract features from the raw audio signal in the frequency domain. Here is an overview
of the typical pipeline:
 Analog to digital: First step in time domain feature extraction pipeline for audio
is, audio needs to be converted from analog to digital format. For this we use
analog to digital converter (ADC), which means we need to sample and quantize
the analog sound so that we get a digital signal.

 Framing: Once we have digitized signal, next step is framing i.e. we bundle
together a bunch of samples. As we can see in the below figure, we bundled
sample 1 to sample 128 together, sample 64 to sample 192 together and so on.
Here we can see the frames are overlapped.
We can think of frames as audio chunks that are perceivable things. One sample at
sampling rate of 44.1KHz, which is sampling rate of CD-ROM. This sample has
duration of 0.0227ms, which is less than ear’s time resolution.
Duration of one sample << ear’s time resolution (10ms).
To speed up this process, we use power of 2 number samples. Typical value of
frames oscillates between 256 – 8192.
Duration of frame is,

Where, Sr is sampling rate = 44.1KHz and K is frame size = 512


So, df = 11.6ms.
This duration is above the ear’s time resolution.
 Time to frequency domain: After framing we move sound from time domain to
frequency domain. This can be done by using Fourier transform. Here we look for
frequency components of sound and how much they contribute to overall sound.

Spectral leakage happens when taking fourier transform of signal i.e., end points
of the signal are discontinuous because they are not integer number of periods.

These discontinuities appear as high frequency domain components not present in


original signal, when processing fourier transform these discontinuities are leaked
into other higher frequencies. Hence it is called spectral leakage.
We can resolve or minimize spectral leakage by using Windowing.

Here we apply windowing function to each frame before we feed the frame into
Fourier transform and by this, we are eliminating the samples at the endpoints
where we completely remove information from endpoints. This generates a
periodic signal which minimizes spectral leakage.
After windowing when we combine the frames together, we have lost information
from the signal as we removed samples at endpoints.

This issue can be resolved by overlapping frames. For overlapping frame size and
hop length is important.

Fig: Non overlapping frames


Fig: Overlapping frames

Apply fourier transform on overlapped frames.

 Feature Computation: After framing, next step is feature computation. Here we


compute features on each of different frames.
 Aggregation: After feature computation, next step is aggregation. Here we
aggregate the features to get single feature vector for the whole sound.
We can aggregate it by using statistical mean, median, sum, gaussian mixture
models (GMM).

As a result of aggregation, we get final value which can be feature vector or feature matrix.
2. Define Audio encoding and explain different audio encoding
formats and their evolution?

Solution:
Audio Encoding:
The domain targeting techniques capable of reducing the bit rate while still preserving
a good perceptual quality is called audio encoding.
The different audio encoding formats are:
 Linear PCM and Compact Discs
 MPEG Digital Audio Coding
 AC Digital Audio Coding
 Perceptual Coding

Linear PCM and compact Discs:


 The linear PCM is simple, but it is most expensive in terms of bit rate and the
most effective for what concerns perceptual quality. It produces the whole
information contained in the original waveform, the linear PCM is said
lossless.
 In general, the samples are represented with B=16 bits because this makes the
quantization error small enough to be inaudible even by trained listeners.
 The sampling frequency commonly used for high-fidelity audio is F=44.1KHz
and this leads to a bit rate of 2FB = 2*44.1KHz*16 = 1411200 bits per second.
Here 2 accounts for the two channels in a stereo recording. Such a bit rate
could be accommodated on the first supports capable of storing digital audio
signals i.e., digital audio tapes (DAT) and compact discs (CD).
 One hour of high-fidelity stereo sound at the 16-bit PCM rate requires roughly
635 MB. A CD can actually store around 750 MB, but the difference is needed
for error correction bits, i.e., data required to recover acquisition errors.
 CDs have been used mainly to replace old vinyl recordings that were often
shorter, the one-hour limit was largely accepted by users, and still is. For this
reason, there was no pressure to decrease the PCM bit-rate in order to store
more sound on CDs.
 The perceptual improvement determined by the use of digital rather than
analogic supports was so high, that the user expectations increased
significantly and the CD quality is currently used as a reference for any other
encoding technique.
 The linear PCM is the basis for several other formats that are used in
conditions where the memory space is not a major problem: Windows WAV,
Apple AIFF and Sun AU. In fact, such formats, with different values of B and
F, are used to store sound on hard disks that are today large enough to contain
hours of recordings and that promise to grow at a rate that makes the space
constraint marginal.
 The same does not apply to telephone communications where a high bitrate
results into an ineffective use of the lines. For this reason, the first efforts in
reducing the bit-rate came from that domain. On the other hand, the
development of encoding techniques for phone communications has an
important advantage: since consumers are used to the fact that the so-called
telephone speech is not as natural as in other applications (e.g., radio and
television), their expectations are significantly lower and the bit-rate can be
reduced with simple modifications of the linear PCM.
 Limitation of the linear PCM is that the quantization step size does not change
with the signal energy. In this way, the parameter B must be kept at a level that
leads to an SNR acceptable at low energies, but high beyond human earing
sensibility at higher energies. In other words, there is a waste of bits at higher
energies.
 The A-law and µ-law logarithmic compander’s address such a problem by
adapting the quantization errors to the amplitude of the signals and reduce by
roughly one third the bit-rate necessary to achieve a certain perceptual quality.
For this reason, the logarithmic compander’s are currently advised by the
International Telecommunications Union (ITU) and are widely applied with A
= 87.55 and µ = 255.
 In case of Telephone communications, user expectations are not directed
towards the highest possible quality, but simply at keeping constant the
perceptual level in a given application. For this reason, the performance of an
encoder is measured not only with the SNR, but also with the mean opinion
score (MOS), a subjective test involving several naive listeners, i.e., people
that do not know encoding technologies.
 Each listener is asked to give a score between 1 (bad) and 5 (excellent) to a
given encoded sound and the resulting MOS value is the average of all
judgments given by the assessors. An MOS of 4.0 or more defines good
quality where the encoded signal cannot be distinguished from the original
one. An MOS between 3.5 and 4.0 is considered acceptable for telephone
communications.
 The test can be performed informally, but the results are accepted in the
official organizations only if they respect the rigorous protocols given by the
ITU.

MPEG Digital Audio Coding:


 Logarithmic compander’s and other approaches based on the adaptation of the
noise to the signal energy obtain significant reductions of the bit-rate. But
these are not sufficient to respect bandwidth and space constraints imposed by
applications developed in the last years. Multimedia, streaming, online
applications, content diffusion on cellular phones, wireless transmission, etc.
require to go beyond the reduction by one-third achieved with A-law and µ
law encoding techniques.
 Also, user expectations correspond now to CD-like quality and any
degradation with respect to such a perceptual level would not be accepted. For
this reason, several efforts were made in the last decade to improve encoding
approaches.
 MPEG is the standard for multimedia, its digital audio coding technique is one
of the major results in audio coding and it involves several major changes with
respect to the linear PCM.
Below is a table for MPEG audio layers. This table reports bit-rates and
compression rates, compared to CD bit-rate, achieved at different layers in the
MPEG coding architecture. The compression rate is the ratio between CD and
MPEG bit-rate at the same audio quality level.

Layer Bit-rate Compression

I 384 kb/sec 4
II 192 kb/sec 8
III 128 kb/sec 12

 The first is that the MPEG architecture is organized in Layers containing sets
of algorithms of increasing complexity. Table below shows the bit-rates
achieved at each layer and the corresponding compression rates with respect to
the 16-bit linear PCM.
 The second important change is the application of an analysis and synthesis
approach implemented in layers I and II. This consists in representingthe
incoming signals with a set of compact parameters, in the case of sound
frequencies, which can be extracted in the encoding phase and used to
reconstruct the signal in the following decoding step. An average MOS of 4.7
and 4.8 has been reported for monaural layer I and II codecs operating at 192
and 128 kbits/sec.
 The third major novelty is the application of psychoacoustic principles capable
of identifying and discarding perceptually irrelevant frequencies in the signal.
Such an approach is called perceptual coding and, since part of the original
signal is removed, the encoding approach is defined lossy.
 The application of the psychoacoustic principles is performed at layer III and
it reduces by 12 the bit-rate of the linear PCM while achieving an average
MOS between 3.1 and 3.7.
 The MPEG layer III is commonly called mp3 and it is used extensively on the
web because of its high compression rate. In fact, the good tradeoff between
perceptual quality and size makes the mp3 files easy to download and
exchange. The format is now so popular that it gives the name to a new class
of products, i.e., the mp3 players

AC Digital Audio Coding:


 The acronym AAC stands for Advanced Audio Coding and the corresponding
encoding technique is considered as the natural successor of the mp3. The
structures of mp3 and AAC are similar, but the latter improves some of the
algorithms included in the different layers.
AAC contains two major improvements with respect to mp3.
 The first is the higher adaptivity with respect to the characteristics of the
audio. Different analysis windows are used when the incoming sound has
frequencies concentrated in a narrow interval or when strong components are
separated by more than 220 Hz. The result is that the perceptual coding gain is
maximized, i.e., most of the bits are allocated for perceptually relevant sound
parts.
 The second improvement is the use of a predictor for the quantized spectrum.
Some audio signals are relatively stationary and the same spectrum can be
used for subsequent analysis frames. When several contiguous frames use the
same spectrum, this must be encoded only the first time and, as a consequence,
the bit-rate is reduced. The predictor is capable of deciding in advance whether
the next frame requires to compute a new spectrum or not.
In order to serve different needs, the AAC provides different profiles of decreasing
complexity:
 The main profile offers the highest quality, the low-complexity profile
does not include the predictor and the sampling-rate-scaleable profile has
the lowest complexity.
 The main profile AAC has shown higher performance the other formats in
several comparisons3:
 A bit-rate of 128 kb/sec, listeners cannot distinguish between original
and coded stereo sound.
 If the bit-rate is decreased at 96 kb/sec, AAC has a quality higher than
mp3 at 128 kb/sec.
 If both AAC and mp3 have a bit-rate of 128 kb/sec, the AAC shows a
significantly superior performance.

Perceptual Coding:
 The main issue in perceptual coding is the identification of the frequencies that
must be coded to preserve perceptual quality or, conversely, of the frequencies
that can be discarded and for which no bits must be allocated.
 The selection, in both above senses, is based on three psychoacoustic
phenomena:
➢ The existence of critical bands,
➢ The absolute threshold of hearing (TOH) and
➢ The masking.
 The TOH as the lowest energy that a signal must carry to be heard by humans
(corresponding to an intensity I0 = 10-12 Watts per square meter). This
suggests as a first frequency removal criterion that any spectral component
with an energy lower than the TOH should not be coded and that the minimum
audible energy is a function of f.
 The function Tq(f) is referred to as absolute TOH and it enables to achieve
better bit-rate reduction by removing any spectral component with energy E0
< Tq(f0).
 Absolute TOH is plotted in Figure below, the lowest energy values correspond
to frequencies ranging between 50 and 4000 Hz, not surprisingly those that
propagate better through the middle ear.
 The main limit of the Tq(f0) introduced above is that it applies only to pure
tones in noiseless environments, while sounds in everyday life have a more
complex structure. In principle, it is possible to decompose any complex
signal into a sum of waves with a single frequency f0 and to remove those
with energy lower than Tq(f0), but this does not take into account the fact that
the perception of different frequencies is not independent.
 Components with a certain frequency can stop the perception of other
frequencies in the auditory system. Such an effect is called masking and it
modifies significantly the curve in Figure above.
 The waves with a given frequency f excite the auditory nerves in the region
where they reach their maximum amplitude (the nerves are connected to the
cochlea walls). When two waves of similar frequency occur together and their
frequency is around the center of a critical band, the excitation induced by one
of them can prevent from hearing the other. In other words, one of the two
sounds (called masker) masks the other one (called maskee). From an
encoding point of view, this is important because no bits accounting for
maskee frequencies need to be allocated in order to preserve good perceptual
quality.

3. Examine the operation of digital cameras.

Solution:

The figure above shows the typical arrangement used to capture and store a digital image
produced by a digital camera. It involves several steps:
1. Capturing the Image:
The first step is to capture the image using the camera's lens and image sensor. The lens
focuses the light onto the image sensor, which captures the image and converts it into digital
data. The image sensor is a 2-D grid of light-sensitive cells known as photosites. Each
photosites stores the level of intensity of the light that falls on it when the camera shutter is
activated. CCD (Charge-Coupled Device) is a widely used image sensor comprises of an
array of photosites on its surface and it operates by converting the level of light intensity that
falls on each photosite into an equivalent electrical charge. The image sensor is responsible
for capturing the image and converting it into digital data. The level of charge (i.e., light
intensity) are stored at each photosite position is read out and then converted into a digital
value using an ADC.
For color images color associated with each photosite and hence the pixel position is obtained
in 3 methods as shown in the figure below:

 Method 1: Surface of each photosite coated with either R, G, or B filter so, that its
charge is determined only by the level of R, G, and B lights that falls on it. Coatings
are arranged on the 3 X 3 grid structure as in Figure above then, color associated with
each photosite/pixel determined by the output of photosite R, G, and B together with
each of its 8 immediate neighbors. The levels of two other colors in each pixel are
estimated by interpolation procedure involving all 9 values. Application: most
consumer-friendly cameras.

 Method 2: It involves use of 3 separate exposures of a single image sensor 1st


through a red filter, 2nd through a green filter, and 3rd through a Blue filter. Color
associated with each pixel position is then, determined by the charge obtained with
each of 3 filters - R, G, and B.
Application: used primarily with high-resolution still-image cameras in locations such
as photographic studios where, cameras can be attached to a tripod but cannot be used
with video cameras since, 3 separate exposures are required for each image.

 Method 3: It uses 3 separate image sensors one with all photosites coated with a red
filter, 2nd with a green filter, 3rd with a blue filter. A single exposure is used with the
incoming light split into 3 beams each of which exposes a separate image sensor.
Application: in professional-quality-high-resolution still and moving image camera as
they are more costly owing to use of the 3 separate sensors and associated signal
processing circuits.

The charge stored at each photosite location is read and digitized when each image/frame is
captured and stored in the image sensor. Using CCD, a set of charges on the matrix of
photosites are read a single row at a time. Each of the photosites are read on a row-by-row
basis. Once in the readout register the charge on each photosite position is shifted out,
amplified and digitized using ADC.

2. Processing the Image:


The digital data is then processed by the camera's processor to create a final image. The
processor adjusts the exposure, white balance, and other settings to create the best possible
image. The processor also compresses the digital data to reduce the file size of the image. The
final image is a digital representation of the original scene captured by the camera.

3. Storing the Image:


The final step is to store the digital image on a memory card or other storage device. The
memory card is where the digital image is stored until it can be transferred to a computer or
other device. The memory card is removable and can be replaced with a new one when it is
full. The size of the memory card determines how many images can be stored on it. Low-
resolution image (640 X 480 pixels) and pixel depth of 24 bits - 8 bits each for R, G, and B
amount of memory required to store each image is 921600 bytes. Numbers of file formats are
used to store sets of images. One of the most popular is a version of the TIFF called TIFF for
electronic photography (TIFF/EP).

4. Transferring the Image:


Once the image is stored on the memory card, it can be transferred to a computer or other
device for further processing or sharing. This can be done using a USB cable, memory card
reader, or wireless transfer. The digital image can be transferred to a computer for editing or
printing, or it can be shared with others via email, social media, or other means.

5. Editing the Image:


The digital image can be edited using photo editing software to adjust the color, brightness,
contrast, and other aspects of the image. This allows you to enhance the image and make it
look its best. Editing software can be used to crop the image, remove unwanted elements, or
add special effects.

4. A series of messages is to be transferred between two computers over a PSTN.


The message comprises just the characters ‘A’ through ‘H’. Analysis has shown
that the probability of each character is as follows:

A=2/40=0.05

B=3/40=0.075

C=4/40=0.1

D=5/40=0.125

E=6/40=0.15

F=7/40=0.175

G=8/40=0.2

H=5/40=0.125

Examine Huffman Code tree for prefix property and derive the code words from
the same.

Solution:

G-0.2 G-0.2 A,B,C-0.225 D,H-0.250 E,F-0.325 A,B,C,G-0.425 D,H,E,F-0.575

F-0.175 F-0.175 G-0.2 A,B,C-0.225 D,H-0.250 E,F-0.325 A,B,C,G-0.425

E-0.15 E-0.15 F-0.175 G-0.2 A,B,C-0.225 D,H-0.250

D-0.125 D-0.125 E-0.15 F-0.175 G-0.2

H-0.125 H-0.125 D-0.125 E-0.15

C-0.1 A,B-0.125 H-0.125

B-0.075 C-0.1

A-0.05

1.0

0 1
0.575
0.425
0 1 0 1

G-0.2 0.225 0.250 0.325

C-0.1 0.125 H-0.125 D-0.125 E-0.15 F-0.175


0 1 0 1 0 1

0 1

A – 0110

B – 0111

C – 010

D – 101

E – 110

F – 111

G – 00

H - 100

You might also like