Professional Documents
Culture Documents
Framing: Once digitized signal is obtained, next step is framing i.e., bundle
together a bunch of samples. As in the below figure, sample 1 to sample 128
bundled together, sample 64 to sample 192 bundled together and so on. Here the
frames are overlapped.
Frames can be thought as audio chunks that are perceivable things. One sample at
sampling rate of 44.1KHz, which is sampling rate of CD-ROM. This sample has
duration of 0.0227ms, which is less than ear’s time resolution.
Duration of one sample << ear’s time resolution (10ms).
To speed up this process, use power of 2 number samples. Typical value of frames
oscillates between 256 – 8192.
Duration of frame is,
As a result of aggregation, final value is obtained which can be feature vector or feature
matrix.
Frequency domain feature pipeline:
The extraction pipeline for frequency domain audio features involves several steps to
extract features from the raw audio signal in the frequency domain. Here is an overview
of the typical pipeline:
Analog to digital: First step in time domain feature extraction pipeline for audio
is, audio needs to be converted from analog to digital format. For this we use
analog to digital converter (ADC), which means we need to sample and quantize
the analog sound so that we get a digital signal.
Framing: Once we have digitized signal, next step is framing i.e. we bundle
together a bunch of samples. As we can see in the below figure, we bundled
sample 1 to sample 128 together, sample 64 to sample 192 together and so on.
Here we can see the frames are overlapped.
We can think of frames as audio chunks that are perceivable things. One sample at
sampling rate of 44.1KHz, which is sampling rate of CD-ROM. This sample has
duration of 0.0227ms, which is less than ear’s time resolution.
Duration of one sample << ear’s time resolution (10ms).
To speed up this process, we use power of 2 number samples. Typical value of
frames oscillates between 256 – 8192.
Duration of frame is,
Spectral leakage happens when taking fourier transform of signal i.e., end points
of the signal are discontinuous because they are not integer number of periods.
Here we apply windowing function to each frame before we feed the frame into
Fourier transform and by this, we are eliminating the samples at the endpoints
where we completely remove information from endpoints. This generates a
periodic signal which minimizes spectral leakage.
After windowing when we combine the frames together, we have lost information
from the signal as we removed samples at endpoints.
This issue can be resolved by overlapping frames. For overlapping frame size and
hop length is important.
As a result of aggregation, we get final value which can be feature vector or feature matrix.
2. Define Audio encoding and explain different audio encoding
formats and their evolution?
Solution:
Audio Encoding:
The domain targeting techniques capable of reducing the bit rate while still preserving
a good perceptual quality is called audio encoding.
The different audio encoding formats are:
Linear PCM and Compact Discs
MPEG Digital Audio Coding
AC Digital Audio Coding
Perceptual Coding
I 384 kb/sec 4
II 192 kb/sec 8
III 128 kb/sec 12
The first is that the MPEG architecture is organized in Layers containing sets
of algorithms of increasing complexity. Table below shows the bit-rates
achieved at each layer and the corresponding compression rates with respect to
the 16-bit linear PCM.
The second important change is the application of an analysis and synthesis
approach implemented in layers I and II. This consists in representingthe
incoming signals with a set of compact parameters, in the case of sound
frequencies, which can be extracted in the encoding phase and used to
reconstruct the signal in the following decoding step. An average MOS of 4.7
and 4.8 has been reported for monaural layer I and II codecs operating at 192
and 128 kbits/sec.
The third major novelty is the application of psychoacoustic principles capable
of identifying and discarding perceptually irrelevant frequencies in the signal.
Such an approach is called perceptual coding and, since part of the original
signal is removed, the encoding approach is defined lossy.
The application of the psychoacoustic principles is performed at layer III and
it reduces by 12 the bit-rate of the linear PCM while achieving an average
MOS between 3.1 and 3.7.
The MPEG layer III is commonly called mp3 and it is used extensively on the
web because of its high compression rate. In fact, the good tradeoff between
perceptual quality and size makes the mp3 files easy to download and
exchange. The format is now so popular that it gives the name to a new class
of products, i.e., the mp3 players
Perceptual Coding:
The main issue in perceptual coding is the identification of the frequencies that
must be coded to preserve perceptual quality or, conversely, of the frequencies
that can be discarded and for which no bits must be allocated.
The selection, in both above senses, is based on three psychoacoustic
phenomena:
➢ The existence of critical bands,
➢ The absolute threshold of hearing (TOH) and
➢ The masking.
The TOH as the lowest energy that a signal must carry to be heard by humans
(corresponding to an intensity I0 = 10-12 Watts per square meter). This
suggests as a first frequency removal criterion that any spectral component
with an energy lower than the TOH should not be coded and that the minimum
audible energy is a function of f.
The function Tq(f) is referred to as absolute TOH and it enables to achieve
better bit-rate reduction by removing any spectral component with energy E0
< Tq(f0).
Absolute TOH is plotted in Figure below, the lowest energy values correspond
to frequencies ranging between 50 and 4000 Hz, not surprisingly those that
propagate better through the middle ear.
The main limit of the Tq(f0) introduced above is that it applies only to pure
tones in noiseless environments, while sounds in everyday life have a more
complex structure. In principle, it is possible to decompose any complex
signal into a sum of waves with a single frequency f0 and to remove those
with energy lower than Tq(f0), but this does not take into account the fact that
the perception of different frequencies is not independent.
Components with a certain frequency can stop the perception of other
frequencies in the auditory system. Such an effect is called masking and it
modifies significantly the curve in Figure above.
The waves with a given frequency f excite the auditory nerves in the region
where they reach their maximum amplitude (the nerves are connected to the
cochlea walls). When two waves of similar frequency occur together and their
frequency is around the center of a critical band, the excitation induced by one
of them can prevent from hearing the other. In other words, one of the two
sounds (called masker) masks the other one (called maskee). From an
encoding point of view, this is important because no bits accounting for
maskee frequencies need to be allocated in order to preserve good perceptual
quality.
Solution:
The figure above shows the typical arrangement used to capture and store a digital image
produced by a digital camera. It involves several steps:
1. Capturing the Image:
The first step is to capture the image using the camera's lens and image sensor. The lens
focuses the light onto the image sensor, which captures the image and converts it into digital
data. The image sensor is a 2-D grid of light-sensitive cells known as photosites. Each
photosites stores the level of intensity of the light that falls on it when the camera shutter is
activated. CCD (Charge-Coupled Device) is a widely used image sensor comprises of an
array of photosites on its surface and it operates by converting the level of light intensity that
falls on each photosite into an equivalent electrical charge. The image sensor is responsible
for capturing the image and converting it into digital data. The level of charge (i.e., light
intensity) are stored at each photosite position is read out and then converted into a digital
value using an ADC.
For color images color associated with each photosite and hence the pixel position is obtained
in 3 methods as shown in the figure below:
Method 1: Surface of each photosite coated with either R, G, or B filter so, that its
charge is determined only by the level of R, G, and B lights that falls on it. Coatings
are arranged on the 3 X 3 grid structure as in Figure above then, color associated with
each photosite/pixel determined by the output of photosite R, G, and B together with
each of its 8 immediate neighbors. The levels of two other colors in each pixel are
estimated by interpolation procedure involving all 9 values. Application: most
consumer-friendly cameras.
Method 3: It uses 3 separate image sensors one with all photosites coated with a red
filter, 2nd with a green filter, 3rd with a blue filter. A single exposure is used with the
incoming light split into 3 beams each of which exposes a separate image sensor.
Application: in professional-quality-high-resolution still and moving image camera as
they are more costly owing to use of the 3 separate sensors and associated signal
processing circuits.
The charge stored at each photosite location is read and digitized when each image/frame is
captured and stored in the image sensor. Using CCD, a set of charges on the matrix of
photosites are read a single row at a time. Each of the photosites are read on a row-by-row
basis. Once in the readout register the charge on each photosite position is shifted out,
amplified and digitized using ADC.
A=2/40=0.05
B=3/40=0.075
C=4/40=0.1
D=5/40=0.125
E=6/40=0.15
F=7/40=0.175
G=8/40=0.2
H=5/40=0.125
Examine Huffman Code tree for prefix property and derive the code words from
the same.
Solution:
B-0.075 C-0.1
A-0.05
1.0
0 1
0.575
0.425
0 1 0 1
0 1
A – 0110
B – 0111
C – 010
D – 101
E – 110
F – 111
G – 00
H - 100