You are on page 1of 13

Advanced Digital Signal Processing

Assignment IV

Q. Explain how audio signal processing is used for music applications.

Music signals possess specific acoustic and structural characteristics that distinguish
them from spoken language or other nonmusical signals. Music signal processing may
appear to be the junior relation of the large and mature field of audio signal processing,
not least because many techniques and representations originally developed for audio
have been applied to music, often with extraordinary results.

The audio signal processing involved in various music centric applications are as
follows:

I. Music Information Retrieval Systems: Currently, music classifi cation and searching
depends entirely upon textual metadata (title of the piece, composer, players,
instruments, etc.). Developing features that can be extracted automatically from
recorded music for describing musical content would be very useful for music
classifi cation and subsequent retrieval based on various user-specifi ed similarity
criteria. The chief attributes of a piece of music are its timbral texture, pitch
content and rhythmic content. The pitch and rhythm relationships between
successive notes make up the melody of the music. These features can be
developed with the help of appropriate audio processing methods which we will
now describe.

A. Timbral texture of music is described by features similar to those used in


speech and speaker recognition. Musical instruments are modelled as
resonators with periodic excitation. The fundamental frequency of the excitation
determines the perceived pitch of the note while the spectral resonances,
shaping the harmonic spectrum envelope, are characteristic of the instrument
type and shape. MFCC have been very successful in characterising vocal tract
resonances for vowel recognition, and this has prompted their use in instrument
identification. Means and variances of the first few cepstral coefficients
(excluding the DC coefficient) are utilised to capture the gross shape of the
spectrum. Other useful spectral envelope descriptors are the means and
variances over a texture window of the spectral centroid, roll-off, flux and
zero-crossing rate.
B. A natural way for a user to query a music database is to perform the query by
singing or humming a fragment of the melody. This characterising of music is
done with the help of various audio processing methods. Typically, the melody is
represented as a sequence of discrete-valued note pitches and durations. A
critical component of query signal processing, then, is audio segmentation into
distinct notes. Transitions in short-term energy or in band-level energies
derived from either auditory or acoustic-phonetic motivated frequency bands
are used to detect vocal note onsets. A pitch label is assigned to a detected note
based on suitable averaging of frame-level pitch estimates across the note
duration. Since people remember the melodic contour (or the shape of the
temporal trajectory of pitch) rather than exact pitches, the note pitches and
durations are converted to relative pitch and duration intervals.

II. Music Segmentation: One of the fundamental areas of audio processing is in


separating or segmenting music into various constituent streams such as vocals,
background music etc. The technique that as enjoyed most widespread success for
this use case is the NMF (Non-Negative Matrix Factorisation) algorithm. Time domain
audio signals are not suited for this method since they include both positive and
negative values. However, the magnitude of the spectrogram meets the non-
negative requirement perfectly. A spectrogram S(t, ω) is calculated by dividing the
time domain signal s(τ) into small frames and performing Discrete Fourier Transform
(DFT) on each frame:︎︎ S(t,ω) = DFT [s(τ)γ(τ−t)], where γ(τ − t) is a small time window.
The magnitude of spectrogram is ∥S(t, ω)∥. The resulting nonnegative matrix is
factorised into the nonnegative matrices W and H. In this schematic example, the red
and green elementary spectra are unmixed and extracted into the dictionary matrix
W. The activation matrix H returns the mixing proportions of each time-frame (a
column of W). The bases in W matrix are features in frequency domain which can be
notes in a certain situation, and the H matrix records their locations along time. After

the factorisation, on basis on clustering, the vocals (lead) can be separated from
the background (accompaniment) as shown.

III. Music Production: Composers work with notes, melodies and chord progressions
to produce a song. It wouldn’t be a stretch to say that a music producer’s toolbox is
made of another kind of creative tool: audio effects. They’re at the core of how
producers shape sound and make it into music. Audio processing turn a so-so mix
into a powerful fi nished track. Common audio processing effects are:

A. Modulation effects: These effects include Chorus, Tremolo, Flanger and Phaser.
For example, for the Chorus effect, the audio processor makes copies of the
original audio signal and applies delay and pitch-modulation to those copies.
This is then mixed to together to give more harmonics to the original audio to
make it a ‘fuller’ audio. Tremolo, on the other hand, is a modulation effect
created by varying the amplitude (volume) of a signal. It gives a trembling effect
and makes the sound more rhythmic, percussive or stuttering.

B. Time-based effects: These effects include Reverb, Delay and Echo. Reverb is a
lot of echoes all happening at the same time, so it is heard as one single sound.
Music composers use audio processing to calculate the needed delay, level,
frequency response and algorithmically generate multiple echoes before
overlapping them into a one sound. Reverb brings sustain to a sound and adds
spaciousness and depth to the signal. Delay and Echoes are just copies of the
original sound signals with different phases or lags. Delay is also the foundation
for other effects including Chorus and Reverb.

C. Spectral effects: These effects include Equalisation and Panning. Equalisation


(EQ) is the cutting or boosting of a particular frequency (or range of
frequencies) in the frequency spectrum of the sound that is being produced. EQ
shapes the existing frequencies of the sound. By cutting or boosting certain
frequencies, EQ shapes the tone and character of the sound and also changes
the balance between the frequencies that are already there. On the other hand,
Panning works by letting through more or less of a signal into each speaker,
creating various spatial effects and is the distribution of a sound signal in a
stereo (or multi-channel) fi eld. Panning creates the illusion of a sound source
moving from one part of the soundstage to another.

D. Dynamic effects: These effects include Compression and Distortion.


Compression is the reduction of dynamic range—the difference between the
loudest and quietest parts of an audio signal. When compression is applied, the
quieter parts of the signal are boosted and the louder ones are attenuated.
Compressors reduce the gain of your signal and even out the notes that peak
allowing the composer to bring up the volume of the sound without clipping.
Distortion adds harmonics and colours the sound in a pleasant way. This works
by reducing the resolution of a sound specifically by lowering the sampling
rate or the bit rate.

P.T.O
Q. Choose a Raag, note its ‘Swars’, play its composite frequencies on MATLAB for
the ‘Aaroh’ and ‘Avaroh’, plot the waveform and pitch contour. Write down your
conclusions.

The chosen Raag is Bhairav and the specification are as follows:

Swars Sa Komal Re Ga Ma Pa Komal Dha Ni Sa’

Frequency 240 256 300 337.5 360 384 450 480


of Swars

Table 1. Bhairav Raag Frequency Composition

Figure 1. Waveform and Pitch Contour (Aaroh)


Figure 2. Waveform and Pitch Contour (Avaroh)

Conclusions:

1. From the waveform plots in both Fig. 1 and Fig. 2 (Aaroh and Avaroh), the start and
end of the swars can be easily identified.

2. We can also see that the start and end of each individual swar coincides perfectly
with a change in the pitch frequency which then stays constant throughout the
duration of that swar.

3. When the Raag is played in Aaroh (Ascending), the Pitch Contour plot shows an
increasing trend (Fig. 1) and when the Raag is played in Avaroh (Descending), the
Pitch Contour plot shown a decreasing trend (Fig. 2) as per expectation and in
complete symmetry.
Q. For the Raag chosen, also plot the power spectrum and verify if the constituent
frequencies can be identified. Write down your conclusions.

Figure 3. Power Spectrum of Aaroh using Short Time Fourier transform (STFT)

Figure 4. Power Spectrum of Avaroh using Short Time Fourier transform (STFT)
Conclusions

1. The frequency components constituting the most power in the Raag audio are
shown in warm colours (refer to the side bars in the Fig. 3 and Fig. 4). From this, we
infer that dominant frequencies in the audio lie between 0.2 KHz to 0.5 KHz or
between 200 Hz to 500 Hz for both, Aaroh and Avaroh, which is indeed the case with
Bhairav Raag (refer Table. 1).

2. More importantly, the power spectrum using STFT shows a time-frequency breakup
of the Raag audio and from it, we can clearly make out the increasing trend of the
dominant frequency on the time axis which concurs with our Pitch Contour plot
(refer Fig. 1 and Fig. 2).

3. The STFT Power Spectrum also shows that the audio has some frequencies upto
5KHz (although not dominant). This can be explained by the fact that in the code we
record the Raag from the speaker and hence some wayward frequencies might have
also been recorded due to the non-ideal frequency response of the speaker leading
to the high albeit non-dominant frequencies observed in Fig. 3 and Fig. 4

Q. How would you use signal processing for speech compression for mobile
phones, as well as speech transmission for mobile phones?

Traditionally, quantisation techniques such as μ Law and A Law companding were used
to lower the bits per second sent over a communication network. However due to
inherent high rigidity in those compression techniques fell out of favour against the
newer LPC (Linear Predictive Coding) methods such as ADPCM which began to gain
rapid acceptance due to their ability to adapt to each speaker individually and increase
the compression ratio by exploiting the high correlation between consecutive samples
of speech. ADPCM and its variants were still waveform encoders.
More recently however, a new class of speech compression or speech coding methods
took form which instead of encoding the waveform of speech, aimed to code and
transmit the minimal amount of information necessary to synthesize speech which is
audibly perceived to be accurate. This is achieved by modelling aspects of the speech
generation process, effectively modelling the vocal tract. Two components are generally
found in CELP and its variants, one which models the long term speech structure, the
pitch, and another, which models the short-term speech structure, the formant. However,
CELP, with its advanced techniques like speech modelling and vector quantisation had
the disadvantage of being slow and computationally expensive compared to ADPCM.

Consequently, the latest in the speech compression and coding which is used by Skype,
WhatsApp, PlayStation 4 and the like, is a technique called CELT. CELT stands for
Constrained Energy Lapped Transform and unlike CELP, offers low latency while giving
equally high compression. Moreover, CELT is open-sourced and completely royalty free
developed by the Opus group. It is suitable for both speech and music. It is based on
Modified Discrete Cosine Transform (MDCP) and borrows ideas from the CELP
algorithm, but avoids some of its limitations by operating in the frequency domain
exclusively.

The initial PCM-Coded signal is handled in relatively small, overlapping blocks for the
MDCT and transformed to frequency coefficients. Choosing an especially short block
size on the one hand enables for a low latency, but also leads to poor frequency
resolution that has to be compensated. For a further reduction of the algorithmic delay
to the expense of a minor sacrifice in audio quality, the by nature 50% of overlap
between the blocks is practically cut down to half by silencing the signal during one
eight at both ends of a block, respectively.
The coefficients are grouped to resemble the critical bands of the human auditory
system as shown below in the diagram which compares the CELT bands with the Bark
scale.
The entire amount of energy of each group is analysed and the values quantised for
data reduction and compressed through prediction by only transmitting the difference
to the predicted values (delta encoding).
The (unquantised) band energy values are removed from the raw DCT coefficients
(normalisation). The coefficients of the resulting residual signal are coded by Pyramid
Vector Quantisation.
This encoding leads to code words of fixed (predictable) length, which in turn enables
for robustness against bit errors and leaves no need for entropy encoding. Finally, all
output of the encoder are coded to one bitstream by a range encoder. In connection
with the PVQ, CELT uses a technique known as band folding, which delivers a similar
effect to Spectral Band Replication (SBR) by reusing coefficients of lower bands for
higher ones, but has much less impact on the algorithmic delay and computational
complexity than the SBR. The decoder unpacks the individual components from the
range coded bitstream, multiplies the band energy to the band shape coefficients and
transforms them back (via iMDCT) to PCM data. The individual blocks are rejoined using
weighted overlap-add (WOLA). Many parameters are not explicitly coded, but instead
reconstructed by using the same functions as the encoder.
The advantages of CELT that make it the unmatched choice for modern speech coding
are as follows:

1. Low algorithmic delay. It allows for latencies of typically 3 to 9 ms but is configurable


to below 2 ms at the price of more bitrate to reach a similar audio quality.
2. CELT supports mono and stereo audio and is applicable to both speech and music.
3. It has very low computational complexity.
4. It enables for constant and variable bitrate.
5. Completely open source.

Q. Explain the role of signal processing in Biomedical applications

Biomedical signals are the recording of the observations of physiological activities of


organisms, ranging from gene and protein sequences, to neural and cardiac rhythms, to
tissue and organ images. Signal processing is then used to either derive insights, draw
conclusions or more recently, extract features for a Machine Learning algorithm for
various classification tasks.

Electroencephalograms (EEGs) are becoming increasingly important measurements of


brain activity and they have great potential for the diagnosis and treatment of mental
and brain diseases and abnormalities.
The cutting edge in EEGs, however, is Brain-Computer Interfaces or BCI. BCI is a system
that establishes a direct communication between the brain and an external device.
The basic setup of a BCI system includes three components:

1. Specific electrodes to record electric, magnetic, or metabolic brain activity. 


2. A processing pipeline to interpret those signals, extracting relevant features
from them, decoding patterns of interest using advanced digital processing
techniques and then outputting commands
3. A computer or external device that operates via the generated commands.

Progress in the field of artificial intelligence has prompted considerable improvements


in how information is processed and decoded from EEG activity. Ultimately, an EEG-
based BCI has to transform the voltage values measured through the electrodes into
digital commands to control the corresponding device. In order to close-the-loop
between the brain and the device, the BCI requires algorithms of signal
processing, feature extraction, and pattern recognition.

• Signal processing usually involves the use of spectral and spatial fi ltering to
maximise the signal to noise ratio (SNR), as well as procedures to deal with
contamination of EEG data (i.e., artifacts) and the changes in the characteristics of
the signal in session-to-session recordings (i.e., nonstationarities). After all the
signal processing steps, one usually needs to simplify this highly-dimensional
sensor-space data into a feature vector that can be handled by a classifi er/
decoder. 

• Feature extraction was traditionally based on previous knowledge of human


electrophysiology, although some modern approaches relying on computers with
high computational power exploit black-box methodologies to automatically
extract relevant features without any prior assumptions.

• With the feature vectors computed from training data, a classifier/decoder is


trained to learn how to detect brain states that should be associated with
control commands for the device. Once the classifier is trained, it can be used to
evaluate new, unseen data to operate in closed-loop.
An end-to-end BCI system works as shown in the diagram below:

You might also like